As asked
Write a Prometheus recording rule that pre-computes the 5-minute HTTP error rate per service, and then write an alert rule that fires when this rate exceeds 5% for more than 2 minutes. Show me the YAML.
Sample answer outline
The recording rule uses rate() over the error counter divided by rate() over the total counter, aggregated by service label. The alert rule references the pre-computed metric, sets a threshold of 0.05, uses a 'for: 2m' clause to avoid flapping, and includes meaningful labels and annotations including a runbook URL. A strong answer notes that dividing two rate() calls can produce NaN when the denominator is 0 and handles it with 'or on()' or a guard clause.
Reference implementation (yaml)
groups:
- name: http_error_rate
interval: 1m
rules:
- record: job:http_error_rate:rate5m
expr: |
sum by (service) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (service) (
rate(http_requests_total[5m])
)
- alert: HighHTTPErrorRate
expr: job:http_error_rate:rate5m > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
runbook: "https://wiki.example.com/runbooks/http-errors"Expect these follow-ups
- How do you prevent the alert from firing during a deployment when error counts briefly spike?
- What is the difference between 'for: 2m' and 'keep_firing_for: 2m'?