Posts

Showing posts from July, 2014

Alerts should treat your Ops team like the police... only contact them in real emergencies!

Every time an alert notifies your Ops team, there should be a real problem.  If there isn't a real problem, you're wasting their time, adding confusion and making it harder for them to respond to real incidents.

The problem we have: Too many alerts for non-issues and non-critical issues. I sat with a member of our Ops team recently and was horrified to see how many notifications they received for our monitoring systems.  At times, it was almost impossible to make any sense from them due to the sheer volume of emails filling their inbox.  
Why is this such a problem? Multiple reasons:

Each notification comes at a cost of lessening the impact of all other notifications.  Take this to the extreme, where Ops receive hundreds per day, the impact of an alert can be almost zero.A false alarm is a distraction and can waste valuable minutes in debugging real issues.

How did we get here? The short answer is: by diligently adding more alerts but not diligently reviewing their behaviour. …