Interview questions

Site reliability engineer interview questions

SREs apply software engineering practices to operations. They define and defend service level objectives, run incident response, build the tooling that keeps distributed systems honest, and push reliability work back into product teams through error budgets.

Seniorities

juniormidseniorstaffprincipal

Topics

Browse by topic

7 topics

Most reported

Top site reliability engineer interview questions

6 questions

Sorted by reported frequency. Click any question for the sample outline, follow-ups, and reference implementation.

As asked

A new Deployment rollout is stuck at 3 out of 10 pods Ready. Walk me through how you would debug it in production.

Sample answer outline

Start with kubectl rollout status and kubectl describe deployment. Look at the events: ImagePullBackOff, CrashLoopBackOff, FailedScheduling all point in different directions. Check pod events and pod logs for the failing replicas. Common causes: readiness probe failing because the app needs longer to start, resource requests too high for the available nodes, a config map or secret reference that does not exist, or a PodDisruptionBudget blocking eviction of old pods. Resist the urge to delete pods until you know the cause.

Reference implementation (bash)

Bash

# Triage sequence
kubectl rollout status deployment/api -n prod
kubectl describe deployment/api -n prod
kubectl get pods -n prod -l app=api
kubectl describe pod <pending-pod> -n prod
kubectl logs <crashlooping-pod> -n prod --previous
kubectl get events -n prod --sort-by=.lastTimestamp | tail -20

Expect these follow-ups

What if the readiness probe passes but the service is still returning 503s?
How do you set sane defaults for readiness and liveness probes?
When would you use a startup probe?

kubernetesdebuggingrollout

As asked

Tell me about the worst on-call shift you have personally had. What broke, what did you do in the moment, and what changed afterwards?

Sample answer outline

Pick a real story with measurable user impact. Cover detection (was it self-detected or customer-reported), the diagnosis path (especially the dead ends), the mitigation, and the postmortem actions that actually landed. Interviewers are listening for ownership without blame, calm under pressure, and process changes that prevent the next one. Avoid 'we just restarted it' answers, dig into the root cause.

Expect these follow-ups

What did the postmortem find as the contributing factors?
Did the action items actually ship, or did they slip?
What would you do differently in the first 10 minutes if it happened again?

incidentson-callpostmortem

As asked

You own the checkout service for an e-commerce site. Define a meaningful set of SLOs and explain the error budget policy you would put in place.

Sample answer outline

Pick SLIs that match what the user experiences: availability (proportion of checkout submissions that return a 2xx within 5 seconds), latency (95th percentile under 800ms), and correctness (no duplicate charges). Set the SLO from the SLI: 99.9 percent availability over 28 days gives a 43-minute error budget. Error budget policy: if we burn the budget, freeze feature work on checkout and direct effort to reliability. Spell out the burn-rate alerts (fast: 2 percent of budget in 1 hour, slow: 10 percent in 6 hours) so on-call gets paged on real impact, not noise.

Expect these follow-ups

What do you do when the business says the SLO is too strict and they want to ship features?
Why is a 99.99 percent SLO usually wrong for a product like this?
How do you measure the correctness SLI in practice?

sloslierror-budget

As asked

Walk me through the most recent incident where you were the incident commander. What was the impact, how did you run it, and what fell out of the postmortem?

Sample answer outline

Set the impact in concrete terms (users affected, revenue, duration) up front. Show the roles you assigned: IC, comms, ops, scribe. Walk through the timeline with the key decisions and the rationale for each. The strong answer is honest about the dead ends and the wasted time. Postmortem: blameless review, contributing factors (not 'root cause', plural), and action items with owners and dates. Mention which action items actually landed and which slipped.

Expect these follow-ups

How did you keep stakeholders informed without distracting the responders?
Did the team trust your decisions in the moment? How do you build that trust?
Which action item would you not assign next time?

incident-commandpostmortemleadership

As asked

A team adds user_id and request_id labels to Prometheus metrics and the metrics backend starts falling over. How do you fix it without losing useful debugging signal?

Sample answer outline

Explain that each unique label set creates a time series, so unbounded labels like user_id, request_id, email, or URL path parameters can explode storage and query cost. Remove the high-cardinality labels from metrics and move per-request details to logs or traces, linked by trace_id. Keep bounded labels such as service, route template, status class, region, and dependency. Add instrumentation review, cardinality budgets, and alerts on series growth so this is caught before ingestion fails. Strong candidates preserve the debugging use case rather than just saying 'delete the label'.

Expect these follow-ups

Which labels would you allow on an HTTP server duration histogram?
How do exemplars help connect metrics to traces?
What query patterns become dangerous after cardinality spikes?

metricsprometheuscardinality

As asked

Walk me through how you would set up multi-window burn rate alerts for an error budget. Why use two windows instead of one, and what values would you choose for a 30-day SLO?

Sample answer outline

Strong answers explain that a fast burn rate window (like 1 hour) catches sudden outages while a slow window (like 6 hours) filters noise and confirms the burn is sustained. The candidate should know the math: a 14.4x burn rate over 1 hour consumes 2% of a 30-day budget. They should mention the Google SRE book's recommended thresholds (2% budget consumed in 1h triggers page, 5% in 6h triggers ticket). They should also explain that single-window alerts either fire too late or produce too many false positives.

Expect these follow-ups

How do you decide what SLO target to set in the first place, and who signs off on it?
If your error budget is exhausted three weeks into a sprint, what concrete actions do you take?

sloslierror-budgetalertingburn-rate

Also known as

reliability engineer - production engineer - systems engineer

Solve coding problems in a live editor

Write your solution in a real in-browser editor and run it against the test cases instantly. No sign-up.

Practice this role with our tools

All tools

# Triage sequence kubectl rollout status deployment/api -n prod kubectl describe deployment/api -n prod kubectl get pods -n prod -l app=api kubectl describe pod <pending-pod> -n prod kubectl logs <crashlooping-pod> -n prod --previous kubectl get events -n prod --sort-by=.lastTimestamp | tail -20

Browse by topic

Top site reliability engineer interview questions

Debug a Kubernetes rollout that is stuckRole-specificmediumVery common

As asked

Sample answer outline

Reference implementation (bash)

Expect these follow-ups

Tell me about the worst on-call shift you have hadBehaviouraleasyVery common

As asked

Sample answer outline

Expect these follow-ups

Define an SLO for a checkout serviceRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Walk me through how you ran the last incident you commandedBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Handle high-cardinality metrics safelyRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain SLO error budget burn rate alertsRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools

Browse by topic

Top site reliability engineer interview questions

Debug a Kubernetes rollout that is stuckRole-specificmediumVery common

As asked

Sample answer outline

Reference implementation (bash)

Expect these follow-ups

Tell me about the worst on-call shift you have hadBehaviouraleasyVery common

As asked

Sample answer outline

Expect these follow-ups

Define an SLO for a checkout serviceRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Walk me through how you ran the last incident you commandedBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Handle high-cardinality metrics safelyRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Explain SLO error budget burn rate alertsRole-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

Also known as

Solve coding problems in a live editor

Practice this role with our tools