Question 1

A team adds user_id and request_id labels to Prometheus metrics and the metrics backend starts falling over. How do you fix it without losing useful debugging signal?

Accepted Answer

Explain that each unique label set creates a time series, so unbounded labels like user_id, request_id, email, or URL path parameters can explode storage and query cost. Remove the high-cardinality labels from metrics and move per-request details to logs or traces, linked by trace_id. Keep bounded labels such as service, route template, status class, region, and dependency. Add instrumentation review, cardinality budgets, and alerts on series growth so this is caught before ingestion fails. Strong candidates preserve the debugging use case rather than just saying 'delete the label'.

Question 2

A developer adds a label called 'user_id' to a metric. Explain exactly what happens inside Prometheus, how you would detect the problem, and what options you have to fix it without losing the insight the developer wanted.

Accepted Answer

The answer should cover how each unique label combination creates a new time series in the TSDB head, how this exhausts memory and slows compaction and queries, and how to detect it with tsdb_head_series or the cardinality analysis endpoint. Fixes include dropping the label via metric_relabel_configs, moving high-cardinality dimensions to log attributes or span attributes instead, or using exemplars to link metrics to traces.

Question 3

An on-call engineer is being paged 200 times during a single outage because every downstream service is alerting on its own. Walk me through how you would use Alertmanager grouping, inhibition, and silence to reduce this to a single meaningful alert.

Accepted Answer

The answer should cover group_by to collapse alerts sharing the same cluster or region label, group_wait and group_interval to batch alerts before firing, and inhibition rules to suppress child-service alerts when a parent (e.g., the database) is already firing a critical alert. The candidate should know that silences are manual and time-bound while inhibitions are automatic and rule-driven. A great answer mentions that inhibit_rules can cause alerts to be silently dropped, making post-incident analysis harder.

Question 4

In OpenTelemetry, when would you attach data as a resource attribute versus a span attribute, and what are the performance implications of each choice?

Accepted Answer

Resource attributes describe the entity producing telemetry (service.name, host.name, k8s.pod.name) and are set once per SDK instance, while span attributes describe a specific operation (http.method, db.statement). Putting high-cardinality per-request data in resource attributes is wrong because it creates a new Resource identity per request, which some backends treat as separate services. A strong answer mentions that Jaeger/Tempo deduplicate resource attributes across spans in the same trace to save storage.

Question 5

Give me the precise definition of a Prometheus counter, gauge, and histogram. For each one, give a concrete example from a web service and explain what mistakes engineers make when choosing the wrong type.

Accepted Answer

A counter is monotonically increasing and is used for events that accumulate (requests total, errors total). A gauge represents a current state that can go up or down (active connections, queue depth). A histogram samples observations into buckets and also provides a sum and count, enabling rate and quantile calculations (request duration, response size). Common mistakes: using a gauge for request counts (resets are invisible), using a counter for latency (you lose distribution info), and not pre-configuring histogram buckets to match the actual distribution of your data.

Question 6

A service is OOMKilled and restarting every 5 minutes. Walk me through the metrics, logs, and events you would check, what the right alert rule looks like in Prometheus, and how you would distinguish OOMKill from crash loops for other reasons.

Accepted Answer

The candidate should mention kube_pod_container_status_restarts_total (from kube-state-metrics) and alerting on the increase over a time window. For root cause, container_memory_working_set_bytes vs container_spec_memory_limit_bytes in cAdvisor metrics identifies approaching the OOM limit. Logs from the previous container can be fetched with kubectl logs --previous. The reason for the restart is in the kubernetes events (kubectl describe pod) and in the container's last terminated reason field (OOMKilled vs Error). A great answer ties these together into a runbook.

Question 7

A team just deployed a new critical payment processing service to production. It has no metrics, no traces, no structured logs, and no alerts. It is already taking live traffic. What do you do first, second, and third?

Accepted Answer

First: immediate risk mitigation using existing infrastructure signals (load balancer access logs, Kubernetes pod metrics from kube-state-metrics) to confirm the service is not actively failing. Second: add auto-instrumentation via OTel agent or a sidecar to get basic HTTP metrics and traces without touching the service code. Third: file an incident for the process failure (how did this pass review?), schedule instrumentation work with the team, and add a check to the deployment pipeline. A strong candidate prioritizes not alerting on absence of data incorrectly over rushing to add bad alerts.

Question 8

An engineer opens their Grafana dashboard and all panels show 'No data'. The service is running and taking traffic. Walk me through a systematic diagnosis.

Accepted Answer

Systematic steps: check the datasource connection in Grafana settings, open the query inspector on a panel to see the raw PromQL query and the response, check that the time range covers a period when the metric should exist, verify the metric exists using the Prometheus /api/v1/label/__name__/values endpoint, check that the Prometheus scrape target is UP, and finally check that the label selectors in the query match the actual labels on the metric. A strong candidate mentions checking both the Grafana datasource config and the Prometheus target health page.

Question 9

What telemetry would you add to an LLM application so on-call can debug latency spikes, cost overruns, and quality regressions?

Accepted Answer

Capture request count, error rate, provider status, queue time, time to first token, total latency, input tokens, output tokens, cache hit rate, and estimated cost by route, tenant, model, and prompt version. Quality telemetry should include eval score distributions, user feedback, refusal rate, tool-call failure rate, and structured-output validation failures. Use traces to connect retrieval, model call, tool calls, and post-processing into one request path. Watch cardinality carefully because raw prompt text and user ids do not belong in metric labels. Candidates often provide infrastructure metrics but miss product-level quality signals.

Question 10

Your team has dozens of drift dashboards and nobody trusts the alerts. How would you redesign model drift monitoring?

Accepted Answer

Start by separating input drift, prediction drift, label drift, and business metric degradation because each has a different owner and response. Alert only on monitored signals with an agreed action, such as retraining, feature pipeline investigation, or business review. Use population slices for important cohorts rather than only global averages, since failures often hide in small but valuable segments. Strong candidates discuss delayed labels, seasonality, alert thresholds, and annotation of known events such as campaigns or outages. A weak answer adds more charts instead of reducing noise and tying alerts to playbooks.

Question 11

Your tracing bill is too high, but engineers rely on traces to debug rare checkout failures. How would you change the sampling strategy?

Accepted Answer

Separate head sampling, which decides early and is cheap, from tail sampling, which can keep traces based on outcome after the request completes. Keep all error traces, slow traces, and traces for important business flows, while sampling routine successful traffic at a lower rate. Propagate sampling decisions consistently across services so one request does not produce broken partial traces. Use attributes with bounded cardinality to drive retention rules, and validate with engineers that rare failure modes remain findable. A weak answer simply lowers the global sampling percentage and loses the exact traces people need during incidents.

Question 12

Walk me through exactly how the OpenTelemetry SDK propagates trace context across asynchronous boundaries in a Node.js service. What mechanism does it use, and where does it break down?

Accepted Answer

Strong answers explain the AsyncLocalStorage-based context manager (ContextManager API) that the Node.js OTel SDK uses, how it attaches context to async continuations, and why instrumentation of raw callbacks or Promise.all with fire-and-forget can silently drop context. A great answer mentions that manual context propagation via context.with() is the escape hatch and describes the W3C Trace Context header as the wire format.

Question 13

Explain the Prometheus TSDB storage layout. What are blocks, chunks, and the WAL? How does compaction work and why does it matter for query latency?

Accepted Answer

A strong answer covers the 2-hour block structure, how chunks within a block are memory-mapped, what the write-ahead log (WAL) is for (crash recovery), and how the compactor merges smaller blocks into larger ones to reduce the number of files that need to be read during range queries. The candidate should explain that uncompacted head blocks cause slower range queries across long windows.

Question 14

Explain the Google SRE burn rate alerting model. Why do we use two windows, what does the fast window catch that a single window misses, and how do you choose the burn rate threshold?

Accepted Answer

The candidate should explain that a single long window (like 1 hour) catches slow burns but delays alerting for fast outages, while a short window (5 minutes) catches fast burns but produces too many false positives on its own. The two-window model fires only when both windows show elevated burn. The burn rate multiple is derived from the SLO budget: a 1x rate means you exhaust the budget exactly at the compliance window; alerting at 14.4x means the budget burns out in 1 hour. A strong answer mentions the tradeoff between detection latency and false positive rate.

Question 15

We are generating 50,000 traces per second and need to reduce that to 500 for storage. Compare head-based and tail-based sampling. Which would you use and why, given that we care about capturing 100% of error traces?

Accepted Answer

Head-based sampling decides at the entry span whether to sample before downstream services run, so errors are frequently dropped because the decision happens before the error is known. Tail-based sampling buffers all spans for a trace and makes the decision after the trace is complete, enabling 100% capture of errors and slow traces. The tradeoff is that tail sampling requires a stateful collector tier (e.g., OTel Collector with tailsampling processor) that must route all spans for a trace to the same instance. The candidate should mention consistent hashing or a load balancer tier to solve the routing problem.

Question 16

We have 40 teams sending telemetry to a central OTel Collector fleet. Some teams need data to go to Datadog, others to Grafana Cloud, and some to both. How do you structure the Collector pipeline, and what risks do you watch for?

Accepted Answer

A strong answer describes using multiple pipelines (receivers, processors, exporters) within a single Collector config or fanning out via the routing connector. The candidate should identify the risk of one team's high-volume data starving queue capacity for others, and propose per-team pipelines with separate queue sizes or a gateway layer that shards by team. They should mention the memory_limiter processor as a safety valve and the risk of data loss when queues fill during exporter backpressure.

Question 17

Explain Loki's label-only indexing model and how it queries log streams. What does this mean for cardinality, query speed on unindexed fields, and total cost compared to a full-text index like Elasticsearch?

Accepted Answer

Loki only indexes the label set (not the log line content), storing chunks per stream in object storage. This makes ingestion cheap and avoids the Elasticsearch index explosion problem, but it means unindexed fields require grep-style chunk scanning which is slow. The right answer covers LogQL's filter expressions running against raw chunks, the importance of keeping label cardinality low (same lessons as Prometheus), and cost comparison: Loki is much cheaper per GB because it avoids full-text indexing write amplification. Bloom filters (added in recent Loki versions) help reduce chunk scanning.

Question 18

Our Prometheus remote_write target goes down for 20 minutes. Describe exactly what Prometheus does: queueing, WAL behavior, memory usage, and what gets lost when the target comes back.

Accepted Answer

Prometheus buffers unsent samples using a separate remote write WAL (distinct from the main TSDB WAL) backed by in-memory queues with capacity controlled by queue_config.capacity and max_samples_per_send. If the outage is long enough that the buffered WAL segments are cleared to reclaim disk, samples are irreversibly lost. Since Prometheus 2.22, the remote write WAL retention can be tuned via wal_retention_time (default 2 hours of buffered data). Memory usage rises as shards accumulate backlog. On recovery, Prometheus replays the WAL in order, which can cause a spike of ingest at the remote end. A strong answer mentions tuning min_shards and max_shards to balance throughput vs memory.

Question 19

I am instrumenting a REST API gateway. What span name, kind, and attributes does the OpenTelemetry HTTP semantic convention require, and what are the common deviations that make traces hard to use?

Accepted Answer

The span name should be 'METHOD /route/template' (e.g., 'GET /users/:id') not the full URL (which has high cardinality). The span kind is SERVER for the inbound side and CLIENT for outbound calls. In the stable HTTP semantic conventions (OTel 1.x), key attributes include http.request.method, http.response.status_code, server.address, and url.path. The older experimental attributes (http.method, net.peer.name) are deprecated in the stable spec. Common mistakes: using the full URL including query string as the span name (cardinality explosion), missing the route template (all spans named 'GET /'), and not setting error.type on 5xx responses, which makes tail sampling on errors miss them.

Question 20

A Grafana dashboard is timing out for a query that aggregates across 200 services over 30 days. What is a recording rule, when should you use one, and what are the gotchas when you add one after the fact?

Accepted Answer

Recording rules pre-compute expensive aggregations on the Prometheus evaluation interval (e.g., every minute) and store the result as a new metric. This turns a fan-in query across 200 time series into a single series lookup, making dashboards fast. Gotchas: adding a recording rule only gives you data from the point the rule was defined, not historical data, so the dashboard will have a gap. A strong answer mentions that Thanos Ruler can backfill recording rules against object storage and that rule files should be tested with promtool.

Question 21

In an OTel Collector pipeline, I have a batch processor, a memory_limiter, a resource detection processor, and a tail sampling processor. What order should they be in and why?

Accepted Answer

The memory_limiter should be first so it can drop data before downstream processors consume additional memory. Resource detection should come before any processor that makes routing decisions based on resource attributes. Tail sampling must come before the batch processor (or the batch processor belongs in the exporter config, not before sampling). A strong answer notes that putting batch before tail sampling causes the sampler to see batched spans rather than individual traces, breaking trace-aware decisions.

Question 22

Our application logs 50,000 lines per second of DEBUG and INFO logs at a cost of $8,000 per month. How do you reduce cost without losing the logs that matter during an incident?

Accepted Answer

The candidate should propose log-level filtering at the application or collector tier (drop DEBUG in production), probabilistic sampling of INFO logs (1 in 100), and 100% retention of WARN, ERROR, and FATAL. A great answer mentions that sampling should be trace-aware: if a trace is sampled (for being slow or errored), all its logs should be retained regardless of level. Structured logging with a trace_id field makes this correlation possible. The candidate should also mention log-based metrics as a way to preserve aggregate signals from dropped logs.

Question 23

An engineer complains that their PromQL query shows steps in the graph and misses a spike that lasted 30 seconds. Explain what scrape interval, evaluation interval, and range vector selectors control, and why the spike was missed.

Accepted Answer

Scrape interval determines how often Prometheus fetches the metric (default 1m). If the spike lasted 30 seconds and the scrape interval is 1 minute, it may never be captured. The evaluation interval controls how often alerting and recording rules fire. The range vector in a query (e.g., [5m] in rate(...[5m])) needs to be at least 2x the scrape interval to reliably get two samples for rate calculation. A strong answer mentions that reducing scrape interval to 15s for latency-sensitive metrics increases storage and CPU cost, and the tradeoff needs to be made consciously.

Question 24

A Grafana dashboard with a multi-value variable 'service' runs a query that takes 30 seconds. Explain how variable interpolation expands the query and what you can do to make it performant.

Accepted Answer

When a multi-value variable expands in PromQL, Grafana typically generates a regex match that Prometheus evaluates as a full label scan against all series. With 50 services selected, this can scan tens of thousands of series. The fix is to use recording rules that pre-aggregate by service, or to restructure the variable to use label_values with a metric that has low cardinality. A strong answer also mentions the 'All' value generating a wildcard regex that is even more expensive.

Question 25

Explain what a dead man's switch (or watchdog) alert is, why it is necessary in an observability stack, and show me how you implement one in Prometheus Alertmanager.

Accepted Answer

A dead man's switch is an alert that always fires, and you configure a receiver (like PagerDuty's 'send_resolved' feature) to alert when the firing STOPS, which means the alerting pipeline itself is broken. In Prometheus, you write an alert rule on vector(1) (which always resolves to 1), which fires continuously. You route it to a special receiver that pages you if it goes missing for more than a few minutes. A strong answer explains that without this, a broken Prometheus or Alertmanager would silently fail to fire any alerts.

Question 26

We use the OpenTelemetry Java agent for auto-instrumentation. Give me three real situations where you had to add manual instrumentation on top of auto-instrumentation, and explain the tradeoffs.

Accepted Answer

Auto-instrumentation covers framework-level spans (HTTP, DB, messaging) but misses business logic. Examples where manual instrumentation is needed: 1) A long-running background job that processes orders needs child spans per order to identify which step is slow. 2) A cache lookup inside a handler is not instrumented by default, so cache miss rate is invisible. 3) Custom batch processing frameworks not in the auto-instrumentation registry. The candidate should note that excessive manual spans increase trace volume and cost, so they should be intentional and targeted.

Question 27

We want to alert if our payment service stops emitting the 'payments_processed_total' metric. Why is this harder than alerting on a threshold, and what are the approaches in Prometheus and Alertmanager?

Accepted Answer

Prometheus uses the absent() function to return 1 when no time series match the selector. The challenge is that absent() fires when the job itself disappears from scrape targets, but it also fires during normal Prometheus restarts and scrape failures, causing false positives. A better approach is to combine absent() with increase() over a window longer than the expected emission interval, and add the job and instance labels explicitly to absent() to get useful alert labels. A great answer mentions that absent_over_time() in newer Prometheus handles this more cleanly.

Question 28

We have an API where /search takes 2 to 5 seconds by design, and /health takes 5ms. How do you define a latency SLI that is meaningful for both without the slow endpoints making the SLO unachievable?

Accepted Answer

The correct approach is separate SLIs per endpoint class, not a single global latency SLI. /search has its own threshold (e.g., 90% of requests under 3 seconds) while /health and /list have tighter thresholds. A global SLI that averages across endpoints is meaningless because the slow endpoints swamp the distribution. A strong answer also discusses the ratio SLI model: (requests within threshold) / (total valid requests), and how to weight SLO compliance by revenue impact of each endpoint class.

Question 29

Prometheus OOMKills every night at 2am. The team has been increasing the memory limit for three months and it keeps happening. What is your diagnostic approach and what are the most likely causes?

Accepted Answer

The candidate should describe checking the Prometheus process_resident_memory_bytes trend over time, querying tsdb_head_series and tsdb_head_chunks for head block size, looking for cardinality spikes at night (a batch job that creates thousands of label combinations), and checking if compaction is running (which temporarily increases memory). Fix options include dropping high-cardinality metrics, increasing compaction frequency, using recording rules to reduce head cardinality, or moving to a remote_write-based architecture. A great answer mentions the --query.max-samples flag as a stopgap for query-driven OOM.

Question 30

We upgraded our Java backend library and now distributed traces break at the boundary between Service A and Service B. All spans show up but they are not connected. Walk me through your debugging process.

Accepted Answer

The candidate should check whether the 'traceparent' header (or B3 headers) is being sent by Service A using a debug proxy or logging the outgoing headers, verify Service B is reading the correct header (propagator mismatch after upgrade is common), check whether the new library version changed the propagator or context API, and look at the OTel SDK logs for propagation warnings. A strong answer mentions that mismatched propagators (B3 vs W3C) between services are the most common root cause of trace link breaks after an upgrade.

Questions

Handle high-cardinality metrics safelyRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What happens when a Prometheus label has unbounded cardinality?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain Alertmanager routing trees, grouping, and inhibition rulesRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

What is the difference between span attributes and resource attributes in OTel?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

When do you use a counter vs a gauge vs a histogram in Prometheus?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

How do you observe and alert on Kubernetes pod restart loops?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

A new service goes to production with zero instrumentation. What do you do?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

The Grafana dashboard shows no data. How do you diagnose it?Role-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Instrument latency and cost for LLM callsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Make drift monitoring actionableRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Choose a distributed tracing sampling strategyRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How does the OpenTelemetry SDK propagate context across async boundaries?Role-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Describe how Prometheus stores time series data on diskRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

How do multi-window burn rate alerts work for SLOs?Role-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Compare head-based and tail-based sampling for distributed tracesRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Design an OTel Collector pipeline for a multi-tenant environmentRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

How does Grafana Loki index logs and what are the tradeoffs vs Elasticsearch?Role-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

What happens when a remote_write endpoint is slow or unreachable?Role-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

What do the OTel semantic conventions specify for HTTP spans?Role-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Why would you use a Prometheus recording rule for a dashboard query?Role-specificmediumCommon

As asked

Sample answer outline