Monitor symptoms, not causes
During my time at Venmo, my team was in charge of running large Jenkins and GitHub Actions CI clusters. We had a lot of monitoring for issues with our CI system, but it always felt like we were** reactive and not proactive.**
If our Jenkins machines ran out of disk space, we’d have an incident, then add monitoring for the disk space issues.
If our Jenkins machines ran out of file descriptors, we’d have an incident, then add monitoring for file descriptors.
The better way to do this is to monitor your problems from your user’s perspective instead.
I recently came across Rob Ewaschuk’s paper on monitoring which explains the concept better than I could. You should monitor symptoms, not causes.
Here’s an excerpt:
I call this “symptom-based monitoring,” in contrast to “cause-based monitoring”. Do your users care if your MySQL servers are down? No, they care if their queries are failing. (Perhaps you’re cringing already, in love with your Nagios rules for MySQL servers? Your users don’t even know your MySQL servers exist!) Do your users care if a support (i.e. non-serving-path) binary is in a restart-loop? No, they care if their features are failing. Do they care if your data push is failing? No, they care about whether their results are fresh.
In the new world with symptom-based monitoring, we’d simply run our pipelines and monitor if they did what they were supposed to. If there was an issue with that, we’d be alerted. This way, regardless of the underlying cause of the issue, we’d catch it proactively.