Explaining Why Noisy Alerting is Dangerous
Personal experience enriched by Google’s SRE Book / Dan Luu’s Post-Mortem Research
The aim of this page📝 is to connect the assertion with experience — as of writing this, I am cleaning up and reflecting on a data delivery outage caused by my team for a very noisy pipeline firing latency alerts because of suboptimal data warehouse configuration. We got accustomed to latency alerts, don’t have the granularity of alerting for silencing it and it would be anyways risky, what if… And that happened and we missed it. Pretty lame. Let’s see what the literature says.
TL;DR >>> Config bugs are 1/2 of all major outages. The major contribution to config bugs is a noisy work environment. Start there. Make alerts matter → reduce config outages.
Configuration bugs, not code bugs, are the most common cause I’ve seen of really bad outages. When I looked at publicly available postmortems, searching for “global outage postmortem” returned about 50% of outages caused by configuration changes. Publicly available postmortems aren’t a representative sample of all outages, but a random sampling of postmortem databases also reveals that config changes are responsible for a disproportionate fraction of extremely bad outages. As with error handling, I’m often told that it’s obvious that config changes are scary, but it’s not so obvious that most companies test and stage config changes like they do code changes.
Paging a human is a quite expensive use of an employee’s time. If an employee is at work, a page interrupts their workflow. If the employee is at home, a page interrupts their personal time, and perhaps even their sleep. When pages occur too frequently, employees second-guess, skim, or even ignore incoming alerts, sometimes even ignoring a “real” page that’s masked by the noise. Outages can be prolonged because other noise interferes with a rapid diagnosis and fix. Effective alerting systems have good signal and very low noise.