Why & When to use GitOps: Use-Cases & Principles
Optimizing Kubernetes Management with GitOps and CD Tools.
A very important aspect for many developers and DevOps is handling alerts. An even more important and often overlooked aspect is alert fatigue. Alert fatigue is caused by a high volume of alerts of which many are false positives, related alerts, or duplicates. Alert handlers become so used to alerts that they disregard and miss important ones. It takes only one or two missed alerts to bring a system to a halt. However, alert fatigue is nothing but a symptom of the underlying issues that need to be addressed.
In this article we want to discuss factors that could lead to alert fatigue, how to identify them and how to handle them. In addition, we will introduce Prometheus Alertmanager and how to use it in order to handle the very same scenarios that lead to alert fatigue.
Before diving in, there are a few prerequisites if you want to follow along with the article:
Developers and DevOps are busy people. When a significant part of their time is spent on looking into or muting alerts that don’t matter, we are looking at alert fatigue. To handle this situation we need to ask ourselves why there are so many alerts and why so many of them are false positives, duplicates, or simply of no value.
There’s a tendency to sometimes overdo alert configuration, mainly as precaution. For example, we would like to know if a service is using a lot of memory or if its CPU is spiking so that we can react in time in case it develops into service disruption. In an attempt to have a preemptive encompassing view of the system, we sometimes configure too many alerts whose thresholds are unjustified. To justify an alert, we need to ask if the CPU or memory spike are really a concern. It’s not uncommon for services to work a little harder at times. If the historical data shows spikes that resolve themselves and don’t correspond to incidents, the alert should not be configured at all. Besides, If we are concerned with service disruption due to resource shortage we should consider automatic scaling, not setting more alerts.
Not setting more alerts is one part of prevention as the best medicine. Dealing with the ones that are already set is the other. Alerts that are already set should be examined through historical and operational lenses:
These questions will help us discover alerts that can be removed to prevent alert fatigue.
However, for mature and perhaps more complicated systems with many services and moving parts, going over hundreds of configured alerts and determining whether they are important or redundant can prove to be quite difficult and time consuming. There’s always the question of ROI when coming to clean up alerts. If we spend a sprint on cleaning alerts it might benefit us down the line by reducing alert fatigue, but we just missed a sprint where we could deliver features and bug fixes. So the trade off is always there. We’d still argue that alert cleanup should take place, even if in small increments. While we are cleaning them up bit by bit, we can introduce an alert manager whose purpose is to provide additional protection against alert fatigue.
An alert manager like Prometheus AlertManager provides a robust way to manage alerts. In the context of handling alert fatigue, the most significant aspect of an alert manager is its ability to group and inhibit alerts.
Let’s look at an example of correlating and grouping alerts. Imagine a scenario in which a datastore such as a database, search engine, or queue manager is reporting high CPU and memory consumption. Services that depend on this datastore might experience difficulties communicating with it. They too might trigger an alert indicating that they cannot communicate with the datastore. With Prometheus AlertManager, it is possible to configure that if there’s an alert for the datastore resource consumption all other related alerts will be grouped together and sent to the same receiver. This way we can see the underlying cause and which services are affected.
Given that alerts are properly labeled and configured to include the service name, team, cluster, region and any other attribute of significance, we can configure an alert as follows:
In the routes directive, we are saying that any alert meant for the data-dev team should be grouped under the data-dev-receiver by cluster and database. To simulate, we can run these commands:
This is what it would look like in Alertmanager:
In other cases we want to suppress related alerts rather than group them with the underlying alert. For example, a database has the following alerts configured:
Any of these alerts can be followed by the others. If an alert triggers for resource consumption, then an unreachable alert might also trigger. If queries take too long to execute, an alert for resource consumption might trigger as well.
So any alert might be followed by other alerts which will create a cascade of incoming alerts when all is needed is just the one. Instead of having them all triggered one after another, Prometheus Alertmanager can be configured to suppress a subset of alerts if a specific alert is active. To handle this situation, we can create the following inhibit rules:
We’re saying that if there’s an active alert for DatabaseResourceConsumptionHigh in the specific database, any DatabaseSlowQueries or DatabaseFailedQueries alert will be inhibited. Inhibited means that the alert will not trigger but will still be visible on demand. The idea here is if there’s an alert on resource consumption, we already know that there’s something going on with the database. We don’t need to be paged for all other issues as they are probably related yet we still have them at hand for investigation.
This is what it would look like in Alertmanager:
Inhibited alerts don’t appear when Inhibited is not selected
Inhibited alerts appear when Inhibited is selected
The ability to suppress alerts using Prometheus Alertmanager can also be used by the individual contributor to help prevent alert fatigue.
An engineer who is making changes to the system and knows that alerts might trigger, should use the AlertManager ability to silence and inhibit alerts. It goes without saying how disturbing it is to get a high number of alerts in the middle of a workday, especially when these alerts are false positives and could have been silenced in advance.
If the alerts cannot be silenced because they are required as indicators during operational changes, the engineer should assume the PD or DOD shift. It might sound trivial but from personal experience we can testify that this is a common problem.
Using an AlertManager and personally assuming responsibility over alerts are the two determining factors in handling alert fatigue. These will greatly reduce the number of alerts and pagers to mitigate and prevent alert fatigue. However, these are but the tools and methods to achieve the goal. They will have to be constantly used and reimplemented.
It’s important to remember that the effort to manage alerts and avoid alert fatigue is an endless process. As systems evolve their monitoring stack evolves as well. Currently configured alert handling rules might have been set up with oversight and should now be revisited. Think of it as a kind of monitoring for your monitoring. We have alerts and alerting rules in place but we have to always ask ourselves whether they are properly set and whether conditions have changed that merit revision of alerting rules.