Five Whys Outside of Incident Post-Mortems
A good practice is performing root cause analysis, such as Five Whys, after a production outage.
If you’re unfamiliar with using Five Whys, here’s an example from Wikipedia:
An example of a problem is: the vehicle will not start.
Why? – The battery is dead.
Why? – The alternator is not functioning.
Why? – The alternator belt has broken.
Why? – The alternator belt was well beyond its useful service life and not replaced.
Why? – The vehicle was not maintained according to the recommended service schedule. (A root cause)
Engineers at tech companies are familiar with this process in incident post-mortems.
However, most engineers don’t reach for root cause analysis outside of incidents.
I’ve found that simply using Five Whys in normal engineering contexts has helped me uncover underlying architectural and infrastructure problems before they become an incident.
Like what you've read?
If you're an engineering leader or developer, you should subscribe to my 80/20 DevOps Newsletter. Give me 1 minute of your day, and I'll teach you essential DevOps skills. I cover topics like Kubernetes, AWS, Infrastructure as Code, and more.
Not sure yet? Check out the archive.
Unsubscribe at any time.