You probably don't need zero downtime

Today, Knock published an article on how they upgraded their RDS instances with zero downtime.

They use an Elixir tech stack, which has a special place in my heart.

One thing I want to comment on is when doing upgrades and migrations like this, consider if you need it to be zero downtime.

Companies will often spend a lot of resources investing in making upgrades and migrations with zero downtime without considering the return on investment.

There’s a high chance that whatever you’re working on doesn’t need to be up 100% of the time. We’re not all working on NASA Mars Rovers.

If you think about it, incidents happen all the time, and they cause downtime. When I worked at Venmo, we had outages sometimes.

You can drastically reduce your engineering burden by setting clear expectations with your customers early.

Valve Software does maintenance on Steam every Tuesday night. Gamers generally know and accept any brief outages in their services. They’re one of the most profitable companies per employee.

When planning infrastructure changes, don’t assume it must be zero downtime.

Most of the time it’s not worth the investment.


Like what you've read?

If you're an engineering leader or developer, you should subscribe to my 80/20 DevOps Newsletter. Give me 1 minute of your day, and I'll teach you essential DevOps skills. I cover topics like Kubernetes, AWS, Infrastructure as Code, and more.

Not sure yet? Check out the archive.

Unsubscribe at any time.