MeteorOps | How to Handle Alert Fatigue

A very important aspect for many developers and DevOps is handling alerts. An even more important and often overlooked aspect is alert fatigue. Alert fatigue is caused by a high volume of alerts of which many are false positives, related alerts, or duplicates. Alert handlers become so used to alerts that they disregard and miss important ones. It takes only one or two missed alerts to bring a system to a halt. However, alert fatigue is nothing but a symptom of the underlying issues that need to be addressed.

‍

What we’ll cover and learn about alert fatigue and Alertmanager

In this article we want to discuss factors that could lead to alert fatigue, how to identify them and how to handle them. In addition, we will introduce Prometheus Alertmanager and how to use it in order to handle the very same scenarios that lead to alert fatigue.

‍

Prerequisites

Before diving in, there are a few prerequisites if you want to follow along with the article:

Prometheus installed - install by helm chart
Alertmanager installed - alertmanager is installed by default when installing Prometheus helm chart
Knowledge of yaml

‍

Too Many Alerts

Developers and DevOps are busy people. When a significant part of their time is spent on looking into or muting alerts that don’t matter, we are looking at alert fatigue. To handle this situation we need to ask ourselves why there are so many alerts and why so many of them are false positives, duplicates, or simply of no value.

‍

Easy Trigger Alert

There’s a tendency to sometimes overdo alert configuration, mainly as precaution. For example, we would like to know if a service is using a lot of memory or if its CPU is spiking so that we can react in time in case it develops into service disruption. In an attempt to have a preemptive encompassing view of the system, we sometimes configure too many alerts whose thresholds are unjustified. To justify an alert, we need to ask if the CPU or memory spike are really a concern. It’s not uncommon for services to work a little harder at times. If the historical data shows spikes that resolve themselves and don’t correspond to incidents, the alert should not be configured at all. Besides, If we are concerned with service disruption due to resource shortage we should consider automatic scaling, not setting more alerts.

‍

Alert Configuration Cleanup

Not setting more alerts is one part of prevention as the best medicine. Dealing with the ones that are already set is the other. Alerts that are already set should be examined through historical and operational lenses:

Have alerts been triggered many times before?
Have they self-resolved?
Have they really indicated some system or service disruption?
Are they perhaps related to other more significant alerts?

These questions will help us discover alerts that can be removed to prevent alert fatigue.

However, for mature and perhaps more complicated systems with many services and moving parts, going over hundreds of configured alerts and determining whether they are important or redundant can prove to be quite difficult and time consuming. There’s always the question of ROI when coming to clean up alerts. If we spend a sprint on cleaning alerts it might benefit us down the line by reducing alert fatigue, but we just missed a sprint where we could deliver features and bug fixes. So the trade off is always there. We’d still argue that alert cleanup should take place, even if in small increments. While we are cleaning them up bit by bit, we can introduce an alert manager whose purpose is to provide additional protection against alert fatigue.

‍

Alert Manager to Handle and Prevent Alert Fatigue

An alert manager like Prometheus AlertManager provides a robust way to manage alerts. In the context of handling alert fatigue, the most significant aspect of an alert manager is its ability to group and inhibit alerts.

‍

Correlating and grouping alerts

Let’s look at an example of correlating and grouping alerts. Imagine a scenario in which a datastore such as a database, search engine, or queue manager is reporting high CPU and memory consumption. Services that depend on this datastore might experience difficulties communicating with it. They too might trigger an alert indicating that they cannot communicate with the datastore. With Prometheus AlertManager, it is possible to configure that if there’s an alert for the datastore resource consumption all other related alerts will be grouped together and sent to the same receiver. This way we can see the underlying cause and which services are affected.

Given that alerts are properly labeled and configured to include the service name, team, cluster, region and any other attribute of significance, we can configure an alert as follows:


alertmanager.yaml:
global: {}
receivers:
  - name: 'default-receiver'
    webhook_configs:
      - url: 'http://example.com/webhook'
  - name: 'data-dev-receiver'
    webhook_configs:
      - url: 'http://example.com/webhook-data-dev'
route:
  group_interval: 5m
  group_wait: 10s
  receiver: default-receiver
  repeat_interval: 3h

  routes:
  - matchers:
      - team="data-dev"
    group_by: ['cluster', 'database']
    receiver: 'data-dev-receiver'
    continue: true

In the routes directive, we are saying that any alert meant for the data-dev team should be grouped under the data-dev-receiver by cluster and database. To simulate, we can run these commands:


curl -H 'Content-Type: application/json' \
     -d '[{
           "labels": {
             "alertname": "DatabaseUnreachable",
             "team": "data-dev",
             "service": "aggregator",
             "database": "analytics",
             "cluster": "prod",
             "severity": "critical"
           }
         }]' \
     http://localhost:9093/api/v2/alerts

curl -H 'Content-Type: application/json' \
     -d '[{
           "labels": {
             "alertname": "DatabaseResourceConsumptionHigh",
             "team": "data-dev",
             "database": "analytics",
             "cluster": "prod",
             "severity": "critical"
           }
         }]' \
     http://localhost:9093/api/v2/alerts

This is what it would look like in Alertmanager:

‍

Inhibiting and Suppressing Alerts

In other cases we want to suppress related alerts rather than group them with the underlying alert. For example, a database has the following alerts configured:

Consuming too much resources
Unreachable
Connections about to max out
File descriptors limit is about to be reached
Queries take too long to execute
Many failed queries

Any of these alerts can be followed by the others. If an alert triggers for resource consumption, then an unreachable alert might also trigger. If queries take too long to execute, an alert for resource consumption might trigger as well.

So any alert might be followed by other alerts which will create a cascade of incoming alerts when all is needed is just the one. Instead of having them all triggered one after another, Prometheus Alertmanager can be configured to suppress a subset of alerts if a specific alert is active. To handle this situation, we can create the following inhibit rules:


inhibit_rules:

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseUnreachable"
    equal: ['database']

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseSlowQueries"
    equal: ['database']

  - source_matchers:
      - alertname="DatabaseResourceConsumptionHigh"
    target_matchers:
      - alertname="DatabaseFailedQueries"
    equal: ['database']

We’re saying that if there’s an active alert for DatabaseResourceConsumptionHigh in the specific database, any DatabaseSlowQueries or DatabaseFailedQueries alert will be inhibited. Inhibited means that the alert will not trigger but will still be visible on demand. The idea here is if there’s an alert on resource consumption, we already know that there’s something going on with the database. We don’t need to be paged for all other issues as they are probably related yet we still have them at hand for investigation.

This is what it would look like in Alertmanager:

Inhibited alerts don’t appear when Inhibited is not selected

Inhibited alerts appear when Inhibited is selected

‍

Personally Contributing to Prevent Alert Fatigue

The ability to suppress alerts using Prometheus Alertmanager can also be used by the individual contributor to help prevent alert fatigue.

‍

Taking Ownership of Alerts

An engineer who is making changes to the system and knows that alerts might trigger, should use the AlertManager ability to silence and inhibit alerts. It goes without saying how disturbing it is to get a high number of alerts in the middle of a workday, especially when these alerts are false positives and could have been silenced in advance.

If the alerts cannot be silenced because they are required as indicators during operational changes, the engineer should assume the PD or DOD shift. It might sound trivial but from personal experience we can testify that this is a common problem.

‍

Handling Alert Fatigue is a Continuous Process

Using an AlertManager and personally assuming responsibility over alerts are the two determining factors in handling alert fatigue. These will greatly reduce the number of alerts and pagers to mitigate and prevent alert fatigue. However, these are but the tools and methods to achieve the goal. They will have to be constantly used and reimplemented.

It’s important to remember that the effort to manage alerts and avoid alert fatigue is an endless process. As systems evolve their monitoring stack evolves as well. Currently configured alert handling rules might have been set up with oversight and should now be revisited. Think of it as a kind of monitoring for your monitoring. We have alerts and alerting rules in place but we have to always ask ourselves whether they are properly set and whether conditions have changed that merit revision of alerting rules.

This is also a heading
This is a heading