DevOps and SRE Interview Preparation

What these interviews are looking for

DevOps and site reliability engineer interviews test a wide surface, but the thread running through all of it is operational maturity. Interviewers want to know whether you can build systems that deploy safely, stay up under load, and recover quickly when they break, and whether you treat reliability as a measurable target rather than a vague aspiration. Knowing the name of a tool is the easy part. Explaining when you would reach for it, what it costs, and what you would lose by using it is what separates strong candidates from people who have only read the documentation.

The rounds usually cover Linux and networking fundamentals, CI/CD and infrastructure as code, a reliability design discussion, and incident response. SRE loops lean harder on reliability theory like service level objectives and error budgets, while DevOps loops lean toward pipelines and automation, but the overlap is large. Prepare across the whole surface rather than betting on one specialism.

The single biggest differentiator across every round is whether you reason from production experience or from theory. Interviewers can tell within two minutes which one they are talking to, because the person who has been paged at 3am talks about blast radius, rollback, and what they would check first, while the person who has only studied talks about features and happy paths.

What good versus weak looks like

It helps to know the shape of a strong answer before you walk in. The same question produces wildly different responses depending on operational maturity.

Dimension	Weak answer	Strong answer
Tool choice	"I would use Kubernetes"	"I would use Kubernetes here because we have many services and need bin packing, but I would not for three services on one box, the operational tax is not worth it"
Reliability	"We aim for high uptime"	"We run to a 99.9% monthly SLO on request success, which gives us about 43 minutes of error budget"
Incidents	"I would find the root cause"	"I would stabilise first with a rollback, then investigate, because the priority is restoring service"
Deploys	"We push to production"	"We promote the same artifact through environments and canary it before full rollout"
Monitoring	"We have dashboards"	"We alert on symptoms users feel, latency and error rate, not on every internal signal"

Solidify the fundamentals

A surprising number of candidates can write Terraform but stumble on what happens underneath. Interviewers probe the basics because production debugging depends on them. Be ready to walk through what happens when a request is slow: is it DNS, the load balancer, the application, the database, or the network in between, and how would you isolate each layer. The candidates who do well narrate a path, they do not list symptoms.

Know your way around a Linux box well enough to diagnose a problem live. How you would find what is consuming memory or CPU, how you would check open connections, how you would read logs and follow a process. Networking comes up too: the difference between latency and throughput, how a load balancer distributes traffic, and what a health check actually verifies. These questions reward someone who has actually debugged production rather than only read about it.

Here is the kind of muscle memory worth rehearsing. When a host is misbehaving, you should be able to talk through a triage sequence without hesitating.

Bash

top -o %CPU              # what is burning CPU right now
free -h                  # memory pressure and swap
df -h                    # disk full is a classic silent killer
ss -tunap | head         # open sockets and listening ports
journalctl -u myapp -f   # follow the service log live

The point is not to recite flags. It is to show that when something breaks you have a reflex for where to look first, and that you understand the order matters. You check the cheap, common causes before the exotic ones. A disk that is 100% full or a process stuck in uninterruptible sleep explains far more outages than a kernel bug.

CI/CD and infrastructure as code

Expect to design or critique a deployment pipeline. A common prompt is "walk me through how code gets from a pull request to production safely." Structure the answer around the stages and the safety at each one, and call out where a bad change gets caught.

Build and test on every change, with the pipeline failing fast on a broken test, so a regression never reaches a human reviewer's attention as a surprise.
Promote the same artifact through environments rather than rebuilding, so what you tested is what ships. Rebuilding per environment quietly reintroduces drift.
Roll out gradually with a canary or blue-green deploy, watching metrics before sending all traffic, so a bad release degrades 1% of users rather than 100%.
Keep a fast, automated rollback, because the question is not whether a bad deploy happens but how quickly you recover. A rollback that needs a manual approval and a 15 minute pipeline is not a rollback.

For infrastructure as code, be ready to talk about why you would manage infrastructure declaratively, how you handle state safely, and why you would never click changes into a console for anything that matters. A short, clear snippet shows fluency.

HCL

resource "aws_autoscaling_group" "api" {
  min_size          = 2
  max_size          = 10
  desired_capacity  = 3
  health_check_type = "ELB"
}

The point to make is that this is version controlled, reviewed, and repeatable, so the environment can be rebuilt from scratch and there is no undocumented snowflake server. Push the conversation further if you can: explain that state is the dangerous part, that you keep it in a remote backend with locking so two engineers cannot corrupt it with a simultaneous apply, and that you separate state per environment so a mistake in staging cannot touch production.

Sample dialogue: critiquing a pipeline

Interviewers often hand you a flawed setup and ask what you would change. A good answer prioritises and explains the risk, it does not just list improvements.

Interviewer: "Our pipeline builds a fresh Docker image in each environment, deploys straight to all production hosts at once, and rollback means re-running the old commit through the full build. What would you fix first?"

Candidate: "Three things, in priority order. First, the all-at-once production deploy is the highest risk, a bad change hits every user instantly, so I would add a canary or rolling deploy with a metric gate. Second, rebuilding per environment means you are not shipping what you tested, so I would build one artifact and promote it. Third, rollback through a full rebuild is too slow when you are mid-incident, so I would keep the previous artifact ready to redeploy in seconds. The deploy strategy is the one I would do first because it bounds the blast radius of every other mistake."

That answer works because it ranks the fixes by blast radius, names the failure each one prevents, and finishes with a one-line justification of the ordering. That is the structure to aim for on any "what would you improve" question.

Reliability and SRE concepts

SRE rounds dig into how you measure and protect reliability. Be precise about the vocabulary, because vague answers here are a clear tell. A service level indicator is the thing you measure, such as the share of requests served under three hundred milliseconds. A service level objective is the target for that indicator, such as 99.9% over a month. The error budget is what is left over, the small fraction of failures you are allowed.

The reason error budgets matter is that they turn reliability into a shared, numeric decision. If you have burned the budget, you slow down and focus on stability. If you have budget to spare, you can ship faster. Framing reliability this way, as a tradeoff the whole team can see, is exactly the maturity SRE interviewers look for. It also reframes an argument that is usually political, product wants velocity and ops wants stability, into a number both sides agreed on in advance.

It helps to have the budget arithmetic in your head, because interviewers do ask you to convert an SLO into real downtime.

Monthly SLO	Allowed error budget	Roughly per 30 days
99%	1%	about 7 hours
99.9%	0.1%	about 43 minutes
99.95%	0.05%	about 22 minutes
99.99%	0.01%	about 4 minutes

The lesson to voice out loud is that each extra nine costs more than the last, often an order of magnitude more in engineering effort, so you choose the target that matches what users actually need rather than reaching for 99.99% by reflex. A batch reporting system and a payments API do not deserve the same SLO.

Be ready to design for resilience: retries with backoff and jitter, circuit breakers so a failing dependency does not cascade, timeouts on every external call, and graceful degradation so a non-critical feature failing does not take down the core path. Name the failure modes before you are pushed. A subtle point worth raising is that naive retries make outages worse, because a struggling service gets hammered by a synchronised retry storm. That is why backoff and jitter matter, they spread the load instead of concentrating it.

Incident response

Almost every loop includes an incident question, often "tell me about the worst outage you handled" or "a service is down, walk me through your response." The interviewer wants calm structure under pressure, not heroics.

Lead with stabilising the system, not finding the root cause. The first job in an incident is to restore service, even with a temporary fix like a rollback or shifting traffic, and only then to investigate why. Describe how you would establish a clear incident commander, keep communication flowing to stakeholders, and avoid the trap of several people making uncoordinated changes at once. A useful framing is to separate the roles: someone runs the incident, someone communicates, someone does the hands-on debugging, and those are different jobs even if a small team means one person wears two hats.

A simple structure you can speak through under pressure:

Acknowledge and assess. Confirm the impact, who is affected, and how badly.
Stabilise. Restore service with the fastest safe lever, usually rollback or traffic shift, before you understand the cause.
Coordinate. Name an incident commander, open a single channel, stop uncoordinated changes.
Communicate. Tell stakeholders what is known, what you are doing, and when the next update lands.
Investigate and resolve. Once the bleeding has stopped, find and fix the real cause.
Learn. Run a blameless postmortem with concrete follow-ups.

After the incident, talk about the blameless postmortem. The goal is to find the systemic causes and the missing guardrails, not a person to blame. Mentioning concrete follow-ups, an alert that should have fired earlier, a missing automated rollback, a runbook that was out of date, shows you treat incidents as a source of improvement rather than something to move past quickly. The phrase that signals maturity here is that humans operating a confusing system is not a root cause, the confusing system is.

A worked incident answer

When asked to tell a real story, use a tight structure: situation, what you saw, what you did first, the resolution, and the lasting fix. Here is the shape of a strong answer.

"Checkout latency spiked and error rate climbed to about 8%. I acknowledged the page, confirmed it was hitting real users on the payment path, and declared an incident rather than poking at it alone. The fastest safe lever was to roll back the deploy from twenty minutes earlier, so I did that first and latency recovered within three minutes. Only then did I dig in. The root cause was a new database query that missed an index and locked under load. The lasting fixes were the index, a query timeout so a slow query degrades instead of cascading, and a pre-deploy check that flags unindexed queries in review. The postmortem focused on why the missing index reached production, not on the engineer who wrote it."

Notice what makes it land: it stabilised before investigating, it gave a number for impact, it named a concrete root cause, and the follow-ups were systemic guardrails rather than "be more careful." That last part is what interviewers are listening for.

Monitoring and observability

Reliability work depends on being able to see what the system is doing. Be ready to discuss the difference between metrics, logs, and traces, and when each helps. Metrics for trends and alerting, logs for the detail of a specific event, traces for following a request across services. A clean way to put it is that metrics tell you something is wrong, traces tell you where, and logs tell you why.

Talk about alerting on symptoms that users feel, like error rate and latency, rather than on every internal signal, because alert fatigue is a real failure mode. An on-call engineer who is woken by noise will eventually miss the alert that matters. Designing alerts that are actionable and tied to user impact is a strong, practical signal. If you can reference a structured approach, the four golden signals of latency, traffic, errors, and saturation, or the USE and RED methods, do so, but explain the idea rather than just dropping the acronym.

A practical test you can offer for any alert: would it wake someone up, and if it fired right now, is there a clear action to take. If the answer to either is no, it should be a dashboard or a ticket, not a page. That distinction between paging alerts and informational signals is something a lot of teams get wrong, and naming it shows you have lived with on-call.

How seniority and role change the bar

The same topics come up at every level, but what counts as a strong answer shifts. Calibrate your answers to the role you are interviewing for.

Level	What they are really testing
Junior to mid	Can you operate safely: read logs, follow a runbook, do a rollback, ask for help at the right moment
Senior	Can you design for reliability and make tradeoffs: choose an SLO, design a deploy strategy, run an incident
Staff and above	Can you set direction: shape the platform, reduce a class of incidents permanently, influence how the org thinks about reliability

Role matters too. A DevOps or platform engineering loop will weight pipelines, IaC, developer experience, and self-service tooling. An SRE loop will weight SLOs, error budgets, capacity, and incident command. A pure platform role may push on multi-tenancy and golden paths. Read the job description and the team before you decide where to go deep, and if you are unsure, ask the recruiter which competencies the loop emphasises.

Common mistakes to avoid

Naming tools without explaining when, why, or at what cost you would use them.
Skipping straight to root cause in an incident answer instead of stabilising first.
Being vague about SLOs and error budgets, which signals you have not run a service to a target.
Forgetting rollback. A deploy strategy with no fast way back is incomplete.
Reaching for maximum reliability or the heaviest tooling by reflex, instead of matching effort to need.
Blaming a person in a postmortem story, which tells the interviewer you have not internalised blameless culture.
Listing improvements without prioritising them. Ranking by blast radius is the signal.

FAQ

Do I need to memorise Kubernetes internals? Understand the concepts deeply, pods, services, deployments, how scheduling and health checks work, and be honest about hands-on depth. Reciting internals you have never used is easy to expose with one follow-up. Knowing when not to use Kubernetes is often a stronger signal than knowing its internals.

How much coding is in these loops? More than people expect. You should be comfortable scripting in Python or Bash, automating a task, and reasoning about a small program. Some SRE loops include a full coding round close to a software engineering interview.

What if I have not run a service with a formal SLO? Be honest, then reason from first principles. Define what you would measure, set a target, and explain how you would use the error budget. Demonstrating the thinking matters more than having the war story.

How do I handle a tool I have not used? Say so plainly, then map it to one you do know and reason about the tradeoffs. "I have not run Argo CD, but I understand GitOps, and the principle is that the desired state lives in git and a controller reconciles to it." Honesty plus transferable reasoning beats bluffing every time.

How to practise

Rehearse out loud. Walk through a deployment pipeline end to end, design an observability stack from scratch, define an SLO and error budget for a service, and talk through a real incident using the stabilise-then-investigate structure. After each, check that you justified your tool choices, named the failure modes before being pushed, prioritised your fixes by blast radius, and tied alerts to user impact. Record yourself once and listen back, the gaps are obvious when you hear them. That operational mindset, reliability as something you measure and defend, is what DevOps and SRE interviews are built to find.

Continue your prep

Apply this against real role questions and templates:

What these interviews are looking for

The single biggest differentiator across every round is whether you reason from production experience or from theory. Interviewers can tell within two minutes which one they are talking to, because the person who has been paged at 3am talks about blast radius, rollback, and what they would check first, while the person who has only studied talks about features and happy paths.

What good versus weak looks like

It helps to know the shape of a strong answer before you walk in. The same question produces wildly different responses depending on operational maturity.

Dimension	Weak answer	Strong answer
Tool choice	"I would use Kubernetes"	"I would use Kubernetes here because we have many services and need bin packing, but I would not for three services on one box, the operational tax is not worth it"
Reliability	"We aim for high uptime"	"We run to a 99.9% monthly SLO on request success, which gives us about 43 minutes of error budget"
Incidents	"I would find the root cause"	"I would stabilise first with a rollback, then investigate, because the priority is restoring service"
Deploys	"We push to production"	"We promote the same artifact through environments and canary it before full rollout"
Monitoring	"We have dashboards"	"We alert on symptoms users feel, latency and error rate, not on every internal signal"

Solidify the fundamentals

Here is the kind of muscle memory worth rehearsing. When a host is misbehaving, you should be able to talk through a triage sequence without hesitating.

Bash

top -o %CPU              # what is burning CPU right now
free -h                  # memory pressure and swap
df -h                    # disk full is a classic silent killer
ss -tunap | head         # open sockets and listening ports
journalctl -u myapp -f   # follow the service log live

CI/CD and infrastructure as code

Build and test on every change, with the pipeline failing fast on a broken test, so a regression never reaches a human reviewer's attention as a surprise.
Promote the same artifact through environments rather than rebuilding, so what you tested is what ships. Rebuilding per environment quietly reintroduces drift.
Roll out gradually with a canary or blue-green deploy, watching metrics before sending all traffic, so a bad release degrades 1% of users rather than 100%.
Keep a fast, automated rollback, because the question is not whether a bad deploy happens but how quickly you recover. A rollback that needs a manual approval and a 15 minute pipeline is not a rollback.

HCL

resource "aws_autoscaling_group" "api" {
  min_size          = 2
  max_size          = 10
  desired_capacity  = 3
  health_check_type = "ELB"
}

Sample dialogue: critiquing a pipeline

Interviewers often hand you a flawed setup and ask what you would change. A good answer prioritises and explains the risk, it does not just list improvements.

Interviewer: "Our pipeline builds a fresh Docker image in each environment, deploys straight to all production hosts at once, and rollback means re-running the old commit through the full build. What would you fix first?"

Candidate: "Three things, in priority order. First, the all-at-once production deploy is the highest risk, a bad change hits every user instantly, so I would add a canary or rolling deploy with a metric gate. Second, rebuilding per environment means you are not shipping what you tested, so I would build one artifact and promote it. Third, rollback through a full rebuild is too slow when you are mid-incident, so I would keep the previous artifact ready to redeploy in seconds. The deploy strategy is the one I would do first because it bounds the blast radius of every other mistake."

Reliability and SRE concepts

It helps to have the budget arithmetic in your head, because interviewers do ask you to convert an SLO into real downtime.

Monthly SLO	Allowed error budget	Roughly per 30 days
99%	1%	about 7 hours
99.9%	0.1%	about 43 minutes
99.95%	0.05%	about 22 minutes
99.99%	0.01%	about 4 minutes

Incident response

A simple structure you can speak through under pressure:

Acknowledge and assess. Confirm the impact, who is affected, and how badly.
Stabilise. Restore service with the fastest safe lever, usually rollback or traffic shift, before you understand the cause.
Coordinate. Name an incident commander, open a single channel, stop uncoordinated changes.
Communicate. Tell stakeholders what is known, what you are doing, and when the next update lands.
Investigate and resolve. Once the bleeding has stopped, find and fix the real cause.
Learn. Run a blameless postmortem with concrete follow-ups.

A worked incident answer

When asked to tell a real story, use a tight structure: situation, what you saw, what you did first, the resolution, and the lasting fix. Here is the shape of a strong answer.

"Checkout latency spiked and error rate climbed to about 8%. I acknowledged the page, confirmed it was hitting real users on the payment path, and declared an incident rather than poking at it alone. The fastest safe lever was to roll back the deploy from twenty minutes earlier, so I did that first and latency recovered within three minutes. Only then did I dig in. The root cause was a new database query that missed an index and locked under load. The lasting fixes were the index, a query timeout so a slow query degrades instead of cascading, and a pre-deploy check that flags unindexed queries in review. The postmortem focused on why the missing index reached production, not on the engineer who wrote it."

Monitoring and observability

How seniority and role change the bar

The same topics come up at every level, but what counts as a strong answer shifts. Calibrate your answers to the role you are interviewing for.

Level	What they are really testing
Junior to mid	Can you operate safely: read logs, follow a runbook, do a rollback, ask for help at the right moment
Senior	Can you design for reliability and make tradeoffs: choose an SLO, design a deploy strategy, run an incident
Staff and above	Can you set direction: shape the platform, reduce a class of incidents permanently, influence how the org thinks about reliability

Common mistakes to avoid

Naming tools without explaining when, why, or at what cost you would use them.
Skipping straight to root cause in an incident answer instead of stabilising first.
Being vague about SLOs and error budgets, which signals you have not run a service to a target.
Forgetting rollback. A deploy strategy with no fast way back is incomplete.
Reaching for maximum reliability or the heaviest tooling by reflex, instead of matching effort to need.
Blaming a person in a postmortem story, which tells the interviewer you have not internalised blameless culture.
Listing improvements without prioritising them. Ranking by blast radius is the signal.

FAQ

How to practise

Continue your prep

Apply this against real role questions and templates:

DevOps and SRE Interview Preparation

What these interviews are looking for

What good versus weak looks like

Solidify the fundamentals

CI/CD and infrastructure as code

Sample dialogue: critiquing a pipeline

Reliability and SRE concepts

Incident response

A worked incident answer

Monitoring and observability

How seniority and role change the bar

Common mistakes to avoid

FAQ

How to practise

Continue your prep

Continue your prep

DevOps engineer interview questions

Site reliability engineer interview questions

Platform engineer interview questions

DevOps and SRE Interview Preparation

What these interviews are looking for

What good versus weak looks like

Solidify the fundamentals

CI/CD and infrastructure as code

Sample dialogue: critiquing a pipeline

Reliability and SRE concepts

Incident response

A worked incident answer

Monitoring and observability

How seniority and role change the bar

Common mistakes to avoid

FAQ

How to practise

Continue your prep

Continue your prep

DevOps engineer interview questions

Site reliability engineer interview questions

Platform engineer interview questions