Question 1

A team adds user_id and request_id labels to Prometheus metrics and the metrics backend starts falling over. How do you fix it without losing useful debugging signal?

Accepted Answer

Explain that each unique label set creates a time series, so unbounded labels like user_id, request_id, email, or URL path parameters can explode storage and query cost. Remove the high-cardinality labels from metrics and move per-request details to logs or traces, linked by trace_id. Keep bounded labels such as service, route template, status class, region, and dependency. Add instrumentation review, cardinality budgets, and alerts on series growth so this is caught before ingestion fails. Strong candidates preserve the debugging use case rather than just saying 'delete the label'.

Question 2

A developer adds a label called 'user_id' to a metric. Explain exactly what happens inside Prometheus, how you would detect the problem, and what options you have to fix it without losing the insight the developer wanted.

Accepted Answer

The answer should cover how each unique label combination creates a new time series in the TSDB head, how this exhausts memory and slows compaction and queries, and how to detect it with tsdb_head_series or the cardinality analysis endpoint. Fixes include dropping the label via metric_relabel_configs, moving high-cardinality dimensions to log attributes or span attributes instead, or using exemplars to link metrics to traces.

Question 3

An on-call engineer is being paged 200 times during a single outage because every downstream service is alerting on its own. Walk me through how you would use Alertmanager grouping, inhibition, and silence to reduce this to a single meaningful alert.

Accepted Answer

The answer should cover group_by to collapse alerts sharing the same cluster or region label, group_wait and group_interval to batch alerts before firing, and inhibition rules to suppress child-service alerts when a parent (e.g., the database) is already firing a critical alert. The candidate should know that silences are manual and time-bound while inhibitions are automatic and rule-driven. A great answer mentions that inhibit_rules can cause alerts to be silently dropped, making post-incident analysis harder.

Question 4

In OpenTelemetry, when would you attach data as a resource attribute versus a span attribute, and what are the performance implications of each choice?

Accepted Answer

Resource attributes describe the entity producing telemetry (service.name, host.name, k8s.pod.name) and are set once per SDK instance, while span attributes describe a specific operation (http.method, db.statement). Putting high-cardinality per-request data in resource attributes is wrong because it creates a new Resource identity per request, which some backends treat as separate services. A strong answer mentions that Jaeger/Tempo deduplicate resource attributes across spans in the same trace to save storage.

Question 5

Give me the precise definition of a Prometheus counter, gauge, and histogram. For each one, give a concrete example from a web service and explain what mistakes engineers make when choosing the wrong type.

Accepted Answer

A counter is monotonically increasing and is used for events that accumulate (requests total, errors total). A gauge represents a current state that can go up or down (active connections, queue depth). A histogram samples observations into buckets and also provides a sum and count, enabling rate and quantile calculations (request duration, response size). Common mistakes: using a gauge for request counts (resets are invisible), using a counter for latency (you lose distribution info), and not pre-configuring histogram buckets to match the actual distribution of your data.

Question 6

A service is OOMKilled and restarting every 5 minutes. Walk me through the metrics, logs, and events you would check, what the right alert rule looks like in Prometheus, and how you would distinguish OOMKill from crash loops for other reasons.

Accepted Answer

The candidate should mention kube_pod_container_status_restarts_total (from kube-state-metrics) and alerting on the increase over a time window. For root cause, container_memory_working_set_bytes vs container_spec_memory_limit_bytes in cAdvisor metrics identifies approaching the OOM limit. Logs from the previous container can be fetched with kubectl logs --previous. The reason for the restart is in the kubernetes events (kubectl describe pod) and in the container's last terminated reason field (OOMKilled vs Error). A great answer ties these together into a runbook.

Browse by topic

Top observability engineer interview questions

Handle high-cardinality metrics safelyRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What happens when a Prometheus label has unbounded cardinality?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain Alertmanager routing trees, grouping, and inhibition rulesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What is the difference between span attributes and resource attributes in OTel?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

When do you use a counter vs a gauge vs a histogram in Prometheus?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How do you observe and alert on Kubernetes pod restart loops?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools

Browse by topic

Top observability engineer interview questions

Handle high-cardinality metrics safelyRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What happens when a Prometheus label has unbounded cardinality?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain Alertmanager routing trees, grouping, and inhibition rulesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What is the difference between span attributes and resource attributes in OTel?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

When do you use a counter vs a gauge vs a histogram in Prometheus?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How do you observe and alert on Kubernetes pod restart loops?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools