Dealing with Large Scale Changes (LSCs)

Teams trying to adhere to Google’s way of doing SRE often run into problems making large scale changes.

There’s a tendency to want to have each team own and implement a particular change that has to be implemented.

For example, say your company needs to upgrade a core dependency used by almost all the microservices. Your team owns this common library and is responsible for implementing the changes. You might try to divide the work amongst each team that owns each service. Then, all your team does is the project management, nagging everybody to do their work of upgrading.

In my experience, this goes poorly almost every single time.

The following excerpt from chapter 22 of “Software Engineering at Google” by Titus Winters, Tom Manshreck, and Hyrum Wright describes exactly why it goes poorly:

As just indicated, the infrastructure teams that build and manage our systems are responsible for much of the work of performing LSCs, but the tools and resources are available across the company. If you skipped Chapter 1, you might wonder why infrastructure teams are the ones responsible for this work. Why can’t we just introduce a new class, function, or system and dictate that everybody who uses the old one move to the updated analogue? Although this might seem easier in practice, it turns out not to scale very well for several reasons.

First, the infrastructure teams that build and manage the underlying systems are also the ones with the domain knowledge required to fix the hundreds of thousands of references to them. Teams that consume the infrastructure are unlikely to have the context for handling many of these migrations, and it is globally inefficient to expect them to each relearn expertise that infrastructure teams already have. Centralization also allows for faster recovery when faced with errors because errors generally fall into a small set of categories, and the team running the migration can have a playbook—formal or informal—for addressing them.

Consider the amount of time it takes to do the first of a series of semi-mechanical changes that you don’t understand. You probably spend some time reading about the motivation and nature of the change, find an easy example, try to follow the provided suggestions, and then try to apply that to your local code. Repeating this for every team in an organization greatly increases the overall cost of execution. By making only a few centralized teams responsible for LSCs, Google both internalizes those costs and drives them down by making it possible for the change to happen more efficiently.

Second, nobody likes unfunded mandates. Even though a new system might be categorically better than the one it replaces, those benefits are often diffused across an organization and thus unlikely to matter enough for individual teams to want to update on their own initiative. If the new system is important enough to migrate to, the costs of migration will be borne somewhere in the organization. Centralizing the migration and accounting for its costs is almost always faster and cheaper than depending on individual teams to organically migrate.

Additionally, having teams that own the systems requiring LSCs helps align incentives to ensure the change gets done. In our experience, organic migrations are unlikely to fully succeed, in part because engineers tend to use existing code as examples when writing new code. Having a team that has a vested interest in removing the old system responsible for the migration effort helps ensure that it actually gets done. Although funding and staffing a team to run these kinds of migrations can seem like an additional cost, it is actually just internalizing the externalities that an unfunded mandate creates, with the additional benefits of economies of scale.

You can read the whole chapter and book here for free.


Join the 80/20 DevOps Newsletter

If you're an engineering leader or developer, you should subscribe to my 80/20 DevOps Newsletter. Give me 1 minute of your day, and I'll teach you essential DevOps skills. I cover topics like Kubernetes, AWS, Infrastructure as Code, and more.

Not sure yet? Check out the archive.

Unsubscribe at any time.