DevOps Glossary

Prometheus Recording Rule

Prometheus Recording Rule is a Prometheus rule that precomputes PromQL into time series for faster alerts.

Prometheus Recording Rule is a Prometheus rule that precomputes a PromQL expression and stores the result as a new time series. In practical terms, it lets you turn expensive or repeated queries into fast, reusable metrics for dashboards, alerts, and service health checks.

What a Prometheus Recording Rule does

A recording rule runs a PromQL query on a schedule and writes the result back into Prometheus with a new metric name. Instead of calculating the same query every time a dashboard loads or an alert evaluates, Prometheus reads the precomputed result.

This is useful when a query is slow, complex, or used in many places. For example, a platform team might precompute a 5-minute HTTP error rate for every service and reuse it in Grafana dashboards and alerting rules.

How it works

Recording rules are defined in Prometheus rule files. Prometheus loads these files and evaluates each rule at the configured rule group interval.

A basic recording rule includes:

  • record: the name of the new time series to create.
  • expr: the PromQL expression to evaluate.
  • labels: optional labels to attach or override on the recorded metric.

Example:

groups:
  - name: service-rates
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

This rule calculates the per-job request rate over 5 minutes every 30 seconds and stores the result as job:http_requests:rate5m.

Common use cases

  • Speeding up dashboards: Precompute heavy PromQL queries used by Grafana panels.
  • Simplifying alerts: Use clean recorded metrics instead of repeating long expressions in alert rules.
  • Standardizing service metrics: Create consistent metrics such as request rate, error rate, latency percentiles, or saturation signals.
  • Reducing query load: Avoid recalculating expensive aggregations across high-cardinality labels.
  • Building service-level indicators: Store reusable SLI metrics such as availability, success rate, or latency compliance.

Recording Rule vs alerting rule

A recording rule creates a new time series from a PromQL expression. An alerting rule evaluates a PromQL expression and fires an alert when a condition is met.

  • Recording rule: “Calculate and store this metric.”
  • Alerting rule: “Notify someone if this condition stays true.”

They often work together. For example, you can create a recording rule for service:http_error_rate:5m, then use that recorded metric in an alert when the value exceeds a threshold.

Benefits

  • Faster queries: Dashboards and alerts can read precomputed values instead of running complex PromQL each time.
  • Cleaner PromQL: Teams can hide long expressions behind readable metric names.
  • More consistent monitoring: Multiple teams can use the same recorded metric for the same concept.
  • Lower runtime cost: Prometheus does less repeated query work during dashboard refreshes and alert evaluations.

Tradeoffs and limitations

  • Extra storage: Recorded metrics are stored as additional time series, so they increase storage usage.
  • Cardinality risk: Recording rules that preserve too many labels can create many new series.
  • Delayed availability: Results are only updated at the rule group interval, such as every 30 seconds or 1 minute.
  • Operational maintenance: Rule names, labels, and expressions need version control and review like application code.
  • Bad rules can spread quickly: If a recorded metric is incorrect, dashboards and alerts that depend on it will also be wrong.

Simple real-world example

Suppose an engineering team runs 80 microservices and tracks request errors with http_requests_total. A raw PromQL query for error rate might be long and repeated across many dashboards:

sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (service) (rate(http_requests_total[5m]))

The team can turn that into a recording rule:

- record: service:http_error_rate:5m
  expr: |
    sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
    /
    sum by (service) (rate(http_requests_total[5m]))

Dashboards and alerts can then use service:http_error_rate:5m directly. This makes queries easier to read and faster to evaluate.

Naming practices

Prometheus recording rule names often describe the aggregation and time window. A common pattern is:

level:metric:operations

Examples include:

  • job:http_requests:rate5m
  • service:http_error_rate:5m
  • cluster:cpu_usage:sum_rate5m

Good names make it clear what the recorded metric means, what labels it keeps, and whether it represents a rate, ratio, sum, or percentile.

When to use a Prometheus Recording Rule

Use a recording rule when a PromQL expression is expensive, reused often, or important enough to standardize across teams. Avoid creating one for every small query. Each recorded metric adds storage and maintenance overhead, so the best candidates are high-value metrics used in alerts, SLOs, and core operational dashboards.