Maintenance windows are a mistake

yokaze · on Oct 5, 2021

The author has clearly an opinion. I am getting more and more the feeling, that IT is a field where opinions are strongly hold, mostly because of past personal bad experiences, and not with much evidence.

There are many ways to approach a problem, and there are good and bad ones, but often you have to make a trade-off. I would appreciate a more differentiated view.

The author makes the math on "high cost of maintenance windows", but leaves out the expenditure side to it. And what is the hand-rule for adding a nine behind the availability? I vaguely remember exponential cost increase. "No downtime, ever", requires next to infinite cost.

The author suggest a false dichotomy. You can have maintenance windows and while aiming for zero downtime. The maintenance window serves the customer a higher predictability on when to expect failure. No failures are obviously preferable, but it is never the question if to cause failures or not, but where you spend your resources on.

rbarrois · on Oct 5, 2021

Indeed — when you look at this from other engineering disciplines: having your train track built for "no downtime, ever" means that you have a third track available, with its dedicated platforms, so that you can work on one track while traffic goes on.

This might make sense for your inner city loop, with trains passing by every 30 seconds (and it's gonna help when one train inevitably breaks down). However, it's a totally unreasonable cost for a station deep in the country where trains only stop 4 times a day.

zahllos · on Oct 5, 2021

A maintenance window doesn't necessarily have to mean downtime either, it could just mean time you can make changes where people are on hand to fix it if things go wrong.

To extend your analogy, it might be a terrible idea to start doing maintenance on a single line track that sees high usage in a holiday season just before those holidays where everyone wants to get away or get home as do your staff. So you might implement what is usually called a change freeze.

On the other hand you might have a different time where demand is low. Shutting down the line if it comes to that will annoy some people but not as many. So you plan your maintenance then and have people ready in case things go wrong.

drewcoo · on Oct 5, 2021

Adopting metaphorical analogs of all the impediments of real engineers does not make us real engineers.

Software is cheap to produce and quick to change. Building a "third track" on the fly, as needed, is exactly the kind of thing we can do that actual engineers can't. They probably wish they had the ability to do things like that.

spiffytech · on Oct 5, 2021

The author also cherry-picks some examples of planned downtime that are anomalously egregious.

I occasionally get notices of planned downtime from services I use. It's almost never an issue for me - the outage windows are usually relatively brief, infrequent, and off-hours. Not every service has the luxury of even having "off-hours", but simply making planned downtime infrequent goes a long ways.

sysadmindotfail · on Oct 5, 2021

>IT is a field where opinions are strongly hold, mostly because of past personal bad experiences, and not with much evidence.

This is spot-on. I see this attitude often when there is a lack of data being exposed for observation and it "just feels" like ____.

obscura · on Oct 5, 2021

As with so many opinions these days, it's at one extreme and ignores the middle ground.

For some apps, the big issue is when you schedule the downtime to happen. If you are truly customer-centric, you should schedule for a time that causes the least disruption for your users. Of course, a decision like this depends on other factors - the number of users likely to be affected, how critical the app changes are, costs, etc.

I've noticed a number of companies in my country taking sites and apps offline in the middle of the workday when they're very likely to be in use. To me, this is unacceptable - the work on the app should be done after hours.

jvvw · on Oct 5, 2021

Surely this is up to the company? They make a decision as to whether to pay somebody to work after-hours versus the business they lose by not doing so.

I specifically took a job that didn't require after-hours work (I have a family which I prioritise) and we did upgrades in working hours. Once in a blue moon a site would be down for a minute or two, but the trade-off was obviously worth it for my public sector employer which generally struggled with recruiting engineers.

dncornholio · on Oct 5, 2021

It's clearly an AWS marketing blog post.

jameshart · on Oct 5, 2021

> The maintenance window serves the customer a higher predictability on when to expect failure.

I don't get this - who wants 'predictable failure'?

If you're a B2C business, this is completely unacceptable - consumers don't plan their google searches around your preannounced downtime windows.

If you're a B2B SaaS business, well... your customers downstream have their own downtime to manage. If they have two vendors, each of which have 'scheduled maintenance windows' that don't overlap, then their combined availability just dropped dramatically.

Far better to build systems that are generally resilient to being down - queues for offline processing, idempotent operations, retriability... you'll wind up needing them to handle scheduled downtime anyway, and once you have them, you can switch to 'as needed' maintenance with no loss of service.

CJefferson · on Oct 5, 2021

Most people don't plan to have downtime, but turns out even the biggest companies in the world can't achieve zero downtime.

Also, some changes are one off and significant (like some database changes), they will hopefully go smoothly, but it is hard to fully prepare for every possible issue.

voidmain · on Oct 5, 2021

I've recently been wondering about the opposite. Maybe the whole Internet should be brought down for a couple of weeks a year, so that society can't become totally dependent on it. Your thermostat shouldn't require a cloud service to work!

dboreham · on Oct 5, 2021

The 2021 Nobel peace prize will be awarded to BGP

mc32 · on Oct 5, 2021

Aside from emergency systems, this is not a bad idea.

People need to voluntarily be disconnected from time to time or experience it due to scheduled down time. Maybe follow some international (or local) holiday schedule.

Life will go on and people will find solutions to the time when things are off.

You should not have to rely on Netflix for entertainment or Facebook for communication, google for information, etc.

People will be better prepared for emergencies this way and won't be left as helpless.

hef19898 · on Oct 5, 2021

One of the most relaxed periods at work, despite a heavy workload, was back when everybody switched from Android phone to iPhones over some security concerns. That switch took almost two weeks, incl. all management functions. So for two weeks, all e-mail and communication had to go through desk phones and your laptop. Back then there was no Teams. And hell was that nice. Nothing was left unattended, everyone was so much more relaxed. And then mobile phones were back again...

pessimizer · on Oct 5, 2021

It's so awful that a network created to withstand nuclear attack has not only evolved to have a bunch of centralized points of failure for itself, but even a bunch of services that serve as points of failure for wider society.

tonyedgecombe · on Oct 5, 2021

A global version of chaos monkey to keep us on our toes.

sumthinprofound · on Oct 5, 2021

A months+ long maintenance windoe is untenable and I would have returned the thermostat as well. That's one example.

In a smaller IT shop (~8 people) maintaining all services for a 500 person 9 to 5 business maintenance windows provide a structured way to to perform all required tasks without the additional overhead of high availability, especially in the cases when it is not necessary.

The caveat is excellent communication regarding the plannee maintenance (what, when, why and the anticipated impact) and when systems will be available again.

But I am not a "recognized thought leader in cloud computing" as the author is, I just know what has been practical in my experience.

foepys · on Oct 5, 2021

Maintenance windows are okay.

You and your customer can save a lot of money by doing so, which is important especially when the customer does not post billions in profit every year or swims in VC money.

Maintenance windows are out of working hours most of the time, so it's not hard to not impact business.

sumthinprofound · on Oct 5, 2021

Exactly. Working around the operations needs of the business and communicating availability to set expectations can suffice in some scenarios where five 9s are not required.

yessirwhatever · on Oct 5, 2021

> Planned downtime is still downtime

Isn’t that obvious? Did it really need to be said?

What the article misses is that in most cases where I worked with clients, downtime was not necessarily for the customer mainly, but for the internal non tech team. You need planned downtime to: have a proper recovery plan in case whatever you’re doing doesn’t work out and you gotta revert, or so that you can work comfortably. I had downtime at different moments day and night and planned downtime during the day or night; planned always wins on so many levels.

If you think planned downtime is not downtime, it’s not the strategy’s fault, it’s your own thinking.

The whole thermostat thing is really not worthy of addressing. Your technical dependency justifies a feeling, but an opinion is a bit much. At best you can call it a half baked thought. Definitely not worth an “opinion piece”

yabones · on Oct 5, 2021

I really don't agree with this hard-line stance on maintenance windows. The use of extreme examples doesn't really strengthen the argument.

It really comes down to the cost of implementing "Super HA" for everything. If that cost is more than the tangible loss of having 30 minutes down each month, it's not worth it.

And it's not just IT stuff that has to be "Super HA", it's everything. Updates need to be atomic. Every database change needs to be able to be half-done so staged rollouts can take place on application servers. That requires 10 to 100 times more effort, and therefore cost. Not to mention testing, certification, etc. It all comes with baggage.

I run a bunch of very small sites on a single Digital Ocean droplet. It has no HA whatsover, it's just a small VM running nginx. Yet, somehow, it has better uptime than every single one of the 10 biggest sites last year (30 min downtime total for 2020). Sure, I could beat that down to zero by using a load balancer and multiple servers, replication, shared storage, all the shiny stuff, but it's just as likely that something could go wrong because of the added complexity, and take me down for an hour or more.

There's never a single Perfect Solution to every problem. That's why we don't build suspension bridges across creeks.

ho_schi · on Oct 5, 2021

Exceptional examples. But what I miss here is instructions how to do it actually well?

Instructions how to

* upgrade server-applications quickly without user interruption

* to create autonomic applications on client-side

* how to change data models in the background

Desktop mail-client like Evolution or "git --everything-is-local" are good examples. They work autonomic and can sync with servers anytime. The examples of the author are awkward exceptions. Real world ones are available. For example Garmin Connect or Google Mail. Garmin Connect cannot do anything autonomic and we've seen an outage last year over a complete week. The same for Google Mail, it cannot download all mails in an mailbox. On a mobile device with 128 GB disk space. You see the irony? A mobile device which will suffer concnetion loss often which a big disk? Leets look at K-9 mail or Evolution: Download all? YES! Sync when possible? YES!

That is my recommendation. Make your local application autnomic and use servers to sync when possible. Of course that cost time in development. And it should - reliability costs. Making a server un-failable isn't possible.

By the way, I'm interested in quick and better migration best practices.

_ph_ · on Oct 5, 2021

The trick with planned downtime is to carefully plan it :). There are a lot of services, which don't have to be highly available 24/7 (of course there are also some which do). Planned downtime means to put the downtime into a time slot, at which there is the least customer impact. For many services, it is possible for the customer to adjust usage to it easily. This doesn't mean, it is generally ok to be down, just because it was announced in advance :)

oauea · on Oct 5, 2021

Almost no one cares enough to pay for high availability and more expensive engineers who can service this. Until that changes, maintenance windows are here to stay.

xyzzy_plugh · on Oct 5, 2021

The problem is that actual high availability becomes too expensive for mere mortals. Even big dogs like Google, Netflix know there is a trade-off that can be worth it financially. The difference between five 9s, six 9s and seven 9s and beyond can be an order of magnitude more expensive at scale.

Maintenance windows suck, but they work, and are predictable. It's easier to upgrade a database by bouncing it with a bit of predictable downtime, much much harder and expensive to do it live with zero downtime.

Edit: not to mention that humans understand and are usually incredibly forgiving. But they are rarely forgiving when surprised with outages.

cjfd · on Oct 5, 2021

I don't think we can conclude something about maintenance windows in general from these examples. Instead we can draw the conclusion that unreasonably long maintenance windows are a mistake. Not that anyone actually needs an article to reach that extremely brilliant conclusion....

okeuro49 · on Oct 5, 2021

Maintenance windows make a lot of sense when your market is domestic, as you can run upgrades during a night shift at the weekend.

Although in the company I used to work for, I can remember there were stories of people who liked to carry out some customer journies at 3am in the morning.

high_5 · on Oct 5, 2021

Maintenance windows are not a bug, they are a feature. Users should learn that there is no such thing as 100% uptime, so everyone should have a contingency plan for downtime. As many just learned the hard way with the FB/IG/WA fallout yesterday.

mortehu · on Oct 5, 2021

This kind of thinking hasn't made its way to iPhones. When it's auto-updating an app I'm not allowed to open it (separately, it seems optimized to start updating an app moments before I'm about to open it).

mschuster91 · on Oct 5, 2021

That's the case with all mainstream operating systems. You can't go and change a program's binary data while it is running without blowing stuff up. As a result, Windows locks executable files for write/delete upon startup and releases the lock upon exit of the last usage, and unixoid OSes warn you to restart applications after an upgrade.

The only environment other than maybe mainframes that supports hot reloading is Java and that has limitations.

Denvercoder9 · on Oct 5, 2021

> You can't go and change a program's binary data while it is running without blowing stuff up.

Linux has zero problems with you overwriting a binary (by unlinking and replacing it) while it's in use. Sure, the old version will keep running, but it doesn't prevent the update, and you'll get the new version when you restart it.

Someone · on Oct 5, 2021

If you have to update multiple files (say the main executable and a few shared libraries that it may only load when it needs them), you’ll need a transactional file system.

Even then there’s still downtime when you restart the executable (could be seconds, but also minutes). Preventing that is possible, too (even on a single machine), but the cost often doesn’t make it worth it.

And nitpick: I think you know it, but for others: “unlinking and replacing” is essential. “Overwriting” (keeping the inode number the same) can be problematic.

teddyh · on Oct 5, 2021

> “unlinking and replacing” is essential. “Overwriting” (keeping the inode number the same) can be problematic.

Yes, that is why /usr/bin/install exists.

krageon · on Oct 5, 2021

> That's the case with all mainstream operating systems

No, in Android you can keep using the app right until the actual swap happens. Then the app restarts (and presumably a symlink is put in place in between). In practical terms this is zero downtime.

dnet · on Oct 5, 2021

Erlang supports hot reloading by design with no limitations. There can even be some threads using the old and some using the new version simultaneously. It was designed for phone exchanges where they aimed for 9 nines of availability. You can install it on most mainstream operating systems.

bregma · on Oct 5, 2021

> That's the case with all mainstream operating systems.

That the case with a mainstream operating system. Only one. All the others can count uptime in years even with regular application updates.

scheme271 · on Oct 5, 2021

It depends. linux requires reboots if you update the kernel (kexec and kpatch are still experimental) and you should probably reboot if you update libc or other similar system software. I don't think any mainstream allows kernels or system software to be updated on the fly and without a reboot.

mschuster91 · on Oct 5, 2021

> All the others can count uptime in years even with regular application updates.

Every major Debian/Ubuntu upgrade asks me for restarting services due to a libc upgrade and warns of potential instability as a result of not doing a restart. Not to mention upgrading systemd, which tends to require a reboot.

rob74 · on Oct 5, 2021

Ok, users are dissatisfied with maintenance windows, I can understand that. But I'd bet they'd be even more dissatisfied if their payment would be booked incorrectly because of a deployment glitch while they were ordering something. So in some cases maintenance windows are still the safest bet...

mikesabbagh · on Oct 5, 2021

I was asked to add a maintenance page to a critical product when updating the system! Adding maintenance page is a type of additional induced downtime!! What u need to do is just do the needed update, and any downtime goes from your downtime credit. Maybe inform that not all services will be up and site performance may be sub-optimal. Best if you can simulate the updates on a test environment beforehand!

Waterluvian · on Oct 5, 2021

Tuesday Maintenance always sucked but at least I knew not to expect to be able to play during that time.

rini17 · on Oct 5, 2021

Or just stop insisting you must be able to control thermostat remotely.

I'm pretty sure if I had smart stuff at home, I'd be tempted to check it all the time. Better avoid that.

Semaphor · on Oct 5, 2021

I generally call any IoT tool that depends on a third-party-service dumb, as nothing you have is smart, the smarts are at the vendor.

ajsnigrutin · on Oct 5, 2021

Homeassistant has gone a long way to make stuff smart, and the maintenance windows are usually when you are at home, on your PC, doing the maintenance yoursef :)

Semaphor · on Oct 5, 2021

I use HA as well. But there are integrations that just wrap/connect a vendor API, which is a no-go for me.

Maintenance window is the morning during coffee, before my wife wakes up :D

ajsnigrutin · on Oct 6, 2021

yeah.. it takes some googling, but a few devices (eg. sonoff plugs) can be flashed with tasmota, and some (mostly zigbee) can be used directly with a usb dongle and zigbee2mqtt, without any vendor-based cloud service (eg. tuya valves for radiators, ikea bulbs, sensors,...). But yeah, either way, you have to do research before you buy the device :)

rcthompson · on Oct 5, 2021

If you're tempted to check it all the time, that really means it's not "smart" enough for you to actually trust it (which is probably true of nearly all "smart" tech for today).

rob74 · on Oct 5, 2021

The dumbest thing about "smart" thermostats are the batteries. My previous company had some in the office, and after a few times of coming to the office on Monday and being confronted with tropical temperatures because the batteries in one of the thermostats had gone empty over the weekend and it had defaulted to "open", I banished any thought of getting these at home...