Question 1

Describe a production incident where a cloud infrastructure failure caused customer impact. What was your role, what did you do, and what did you change afterward?

Accepted Answer

Strong answers follow STAR: the situation with specific blast radius (how many customers, how long), the actions taken during the incident (detection, mitigation, communication), the root cause found in the postmortem, and concrete process or architecture changes that reduced recurrence probability. Interviewers look for ownership, technical accuracy, and learning orientation rather than blame.

Question 2

Tell me about a time you identified and drove a meaningful reduction in cloud infrastructure costs. How did you find the opportunity and what was the outcome?

Accepted Answer

Candidate should quantify the saving (percentage or dollar amount), describe the discovery method (Cost Explorer, Trusted Advisor, rightsizing analysis), explain how they got buy-in from engineering or finance, and describe what guardrails were put in place so the cost did not creep back. Interviewers look for data-driven thinking and cross-functional communication, not just technical execution.

Question 3

Tell me about a time when a security requirement threatened to delay a launch and you had to navigate the tradeoff between the two.

Accepted Answer

A strong answer shows the candidate did not simply capitulate to either pressure: they quantified the security risk (likelihood and impact), proposed a time-bounded compensating control that allowed the launch, and committed to closing the gap within a defined window. Interviewers look for risk articulation, stakeholder communication, and follow-through on the remediation.

Question 4

Describe a significant infrastructure migration you led or were a key contributor to. What was the scope, what went wrong, and how did you keep it on track?

Accepted Answer

Candidate should describe the scale (number of services, teams involved, timeline), the planning approach (phased migration, dry runs, rollback plan), what unexpected issue arose mid-migration, and how they adapted. Interviewers want evidence of project management skill alongside technical depth, and the ability to handle ambiguity without stalling.

Question 5

Have you ever tried to introduce or enforce infrastructure-as-code standards across teams that were not your own? How did you get adoption?

Accepted Answer

Candidate should describe the specific standard (module structure, naming convention, tagging enforcement), how they built the case (showing concrete problems the standard solves), the approach to adoption (working examples, office hours, golden-path templates), and how they measured uptake. Forcing standards via governance alone without making the good path easy rarely works.

Question 6

Tell me about a time when you joined or were on a cloud operations on-call rotation that was overwhelmed with noisy or manual alerts. What did you change?

Accepted Answer

Candidate should identify the root cause of on-call pain (alert volume, alert quality, missing runbooks, manual remediation steps), describe specific changes made (threshold tuning, composite alarms, runbook automation with SSM, better dashboards), and show before-and-after metrics (alerts per week, MTTR). Strong answers also mention psychological safety in postmortems.

Question 7

Describe a time when you believed a proposed cloud architecture had a serious flaw and you pushed back on it. How did you make your case and what was the outcome?

Accepted Answer

Candidate should describe the proposed design and the specific flaw (single point of failure, cost trap, security hole, etc.), how they gathered evidence (data, Well-Architected principles, prior incidents), and how they presented the alternative constructively rather than just criticising. The outcome may have been that they were overruled and should be able to describe how they handled that too.

Question 8

Describe a cloud tool or internal platform you built that measurably improved other engineers' ability to provision or operate cloud infrastructure. How did you measure the impact?

Accepted Answer

Candidate should describe the pain point solved (slow manual provisioning, inconsistent environments, no self-service), the solution (Terraform module library, account vending machine, internal dev portal), and how adoption and impact were measured (time to provision, number of teams using it, DORA metrics). Strong answers treat internal tools with the same product thinking as customer-facing ones.

Question 9

Tell me about a time you identified significant cloud infrastructure technical debt, such as legacy manual provisioning, unmanaged resources, or ageing AMIs, that was causing operational pain. How did you prioritise addressing it alongside feature work?

Accepted Answer

Candidate should frame the debt in business terms (cost, reliability risk, developer velocity), show how they made it visible to stakeholders (incident frequency correlation, cost to maintain), and describe the negotiation for time to address it. Good answers show incrementalism: addressing the highest-risk debt first rather than requesting a full freeze on feature work.

Question 10

Describe a time you evaluated two or more cloud services or vendors for the same use case. How did you structure the evaluation and make the final recommendation?

Accepted Answer

Candidate should describe defining evaluation criteria upfront (cost, operational complexity, lock-in, performance, team familiarity), building a proof of concept for each option, involving the relevant stakeholders in the review, and writing a clear recommendation with the tradeoffs documented. Strong answers mention what they would revisit if the initial choice proves wrong.

Question 11

Describe a time you had to prepare AWS or cloud infrastructure for an external compliance audit (SOC 2, ISO 27001, PCI-DSS, or similar). What did you do and what surprised you?

Accepted Answer

Candidate should describe the compliance framework, the evidence collection process (CloudTrail logs, Config compliance reports, IAM access reviews), how they worked with security and legal, and what gaps they found and closed. A strong answer mentions the operationalisation after the audit to keep evidence collection continuous rather than point-in-time.

Question 12

You arrive Monday morning to an alert that your AWS bill for Sunday was double the typical daily spend. You have no immediate idea what caused it. Walk me through your first 30 minutes.

Accepted Answer

Candidate should describe checking Cost Explorer by service, then by linked account and region to isolate the spike, then drilling into CloudTrail for CreateResource events in that time window. Common culprits: runaway EC2 Spot Fleet, data transfer spike, new NAT Gateway traffic, or a DDoS generating LB requests. Good answers also describe who they notify immediately and whether they set a Budget alert to catch future spikes.

Question 13

A colleague ran terraform state rm on the wrong resource and now Terraform shows a plan to create 30 resources that already exist. Production is running fine. What do you do?

Accepted Answer

Candidate should describe immediately locking the workspace to prevent further applies, restoring from the S3 bucket versioned state backup (S3 versioning is why you enable it), verifying the restored state with terraform plan to confirm 0 changes, and doing a blameless postmortem. If the backup predates other legitimate changes, they need to selectively terraform import the missing resources rather than restoring wholesale.

Question 14

During a routine audit you find that a production EC2 security group has port 22 open to 0.0.0.0/0. The engineer who owns it is on holiday. What do you do?

Accepted Answer

Candidate should describe assessing the risk (how long has this been open, any sign of access in CloudTrail or VPC Flow Logs, is the instance publicly accessible), then closing the rule immediately without waiting for the owner since the risk is clear. They should notify the owner's manager and open a ticket. Strong answers mention checking whether there are other security groups with similar rules (Config rule, Security Hub control) to find the scope of the problem.

Question 15

AWS us-east-1 is experiencing a major outage affecting your primary deployment. Your CEO is asking for an ETA on recovery. What do you do in the first hour?

Accepted Answer

Candidate should distinguish between failing over to a secondary region (if DR architecture exists) versus waiting for AWS to recover. First steps: check AWS Service Health Dashboard and Status page, assess which customers are affected and communicate status honestly, evaluate whether failover to the secondary region is feasible given RTO/RPO, and what manual steps that failover requires. Strong answers note that if no DR exists, now is the wrong time to design one.

Question 16

A terraform plan on a stable production module suddenly shows 5 resource changes that no one made in code. How do you investigate and decide whether to apply?

Accepted Answer

Candidate should describe comparing the plan output against recent git commits to rule out code changes, then checking CloudTrail for manual console changes to those resources (drift), and using terraform show to compare current state with what terraform is proposing. If drift is confirmed, they should terraform import or update state to reflect reality rather than blindly applying the plan which may destroy the manual change.

Question 17

You get paged at 2 AM because Lambda invocations are timing out across multiple functions. P99 duration jumped from 200ms to 30 seconds in the last 15 minutes. What do you do?

Accepted Answer

Candidate should first check if a downstream dependency is the cause (RDS, DynamoDB, external API) by looking at X-Ray traces to find where time is spent. Also check if VPC-attached Lambdas lost connectivity (DNS resolution failure, security group change, subnet exhaustion). Immediate mitigation may be to increase Lambda timeout temporarily, but the real fix requires identifying the bottleneck. Candidate should also check if there was a recent deployment that changed the functions.

Question 18

A team launched a new service last month. Their first AWS bill is 10 times the estimate they gave you. They had no budget alerts configured. How do you handle this conversation and what systemic changes do you make?

Accepted Answer

Candidate should describe getting the data first (Cost Explorer by service for that account), understanding what drove the overage (data transfer, NAT Gateway, over-provisioned instances) before the conversation, then presenting findings factually without blame. Systemic change: mandatory Budget alerts before any service goes live, FinOps tagging validation in CI/CD, and a cloud cost review at launch checklist.

Question 19

A CVE is published for the base AMI used by 50 production EC2 instances. The vendor has released a patched AMI. You need to roll out the patch with no more than 5 minutes of downtime per instance. Walk me through the plan.

Accepted Answer

Candidate should describe building and testing a new AMI with the patch using a golden AMI pipeline (Packer), updating the Launch Template with the new AMI version, and using an Auto Scaling Group instance refresh with a minHealthyPercentage of 80% so a small batch is replaced at a time. For instances not in an ASG, they should use SSM Run Command to apply the patch in-place where possible, or use an Instance Refresh analogue. Communication plan to stakeholders is also expected.

Question 20

Describe a production incident where a cloud infrastructure failure caused customer impact. What was your role, what did you do, and what did you change afterward?

Accepted Answer

Strong answers follow STAR: the situation with specific blast radius (how many customers, how long), the actions taken during the incident (detection, mitigation, communication), the root cause found in the postmortem, and concrete process or architecture changes that reduced recurrence probability. Interviewers look for ownership, technical accuracy, and learning orientation rather than blame.

Question 21

Tell me about a time you identified and drove a meaningful reduction in cloud infrastructure costs. How did you find the opportunity and what was the outcome?

Accepted Answer

Candidate should quantify the saving (percentage or dollar amount), describe the discovery method (Cost Explorer, Trusted Advisor, rightsizing analysis), explain how they got buy-in from engineering or finance, and describe what guardrails were put in place so the cost did not creep back. Interviewers look for data-driven thinking and cross-functional communication, not just technical execution.

Question 22

Tell me about a time when a security requirement threatened to delay a launch and you had to navigate the tradeoff between the two.

Accepted Answer

A strong answer shows the candidate did not simply capitulate to either pressure: they quantified the security risk (likelihood and impact), proposed a time-bounded compensating control that allowed the launch, and committed to closing the gap within a defined window. Interviewers look for risk articulation, stakeholder communication, and follow-through on the remediation.

Question 23

Describe a significant infrastructure migration you led or were a key contributor to. What was the scope, what went wrong, and how did you keep it on track?

Accepted Answer

Candidate should describe the scale (number of services, teams involved, timeline), the planning approach (phased migration, dry runs, rollback plan), what unexpected issue arose mid-migration, and how they adapted. Interviewers want evidence of project management skill alongside technical depth, and the ability to handle ambiguity without stalling.

Question 24

Have you ever tried to introduce or enforce infrastructure-as-code standards across teams that were not your own? How did you get adoption?

Accepted Answer

Candidate should describe the specific standard (module structure, naming convention, tagging enforcement), how they built the case (showing concrete problems the standard solves), the approach to adoption (working examples, office hours, golden-path templates), and how they measured uptake. Forcing standards via governance alone without making the good path easy rarely works.

Question 25

Tell me about a time when you joined or were on a cloud operations on-call rotation that was overwhelmed with noisy or manual alerts. What did you change?

Accepted Answer

Candidate should identify the root cause of on-call pain (alert volume, alert quality, missing runbooks, manual remediation steps), describe specific changes made (threshold tuning, composite alarms, runbook automation with SSM, better dashboards), and show before-and-after metrics (alerts per week, MTTR). Strong answers also mention psychological safety in postmortems.

Question 26

Describe a time when you believed a proposed cloud architecture had a serious flaw and you pushed back on it. How did you make your case and what was the outcome?

Accepted Answer

Candidate should describe the proposed design and the specific flaw (single point of failure, cost trap, security hole, etc.), how they gathered evidence (data, Well-Architected principles, prior incidents), and how they presented the alternative constructively rather than just criticising. The outcome may have been that they were overruled and should be able to describe how they handled that too.

Question 27

Describe a cloud tool or internal platform you built that measurably improved other engineers' ability to provision or operate cloud infrastructure. How did you measure the impact?

Accepted Answer

Candidate should describe the pain point solved (slow manual provisioning, inconsistent environments, no self-service), the solution (Terraform module library, account vending machine, internal dev portal), and how adoption and impact were measured (time to provision, number of teams using it, DORA metrics). Strong answers treat internal tools with the same product thinking as customer-facing ones.

Question 28

Tell me about a time you identified significant cloud infrastructure technical debt, such as legacy manual provisioning, unmanaged resources, or ageing AMIs, that was causing operational pain. How did you prioritise addressing it alongside feature work?

Accepted Answer

Candidate should frame the debt in business terms (cost, reliability risk, developer velocity), show how they made it visible to stakeholders (incident frequency correlation, cost to maintain), and describe the negotiation for time to address it. Good answers show incrementalism: addressing the highest-risk debt first rather than requesting a full freeze on feature work.

Question 29

Describe a time you evaluated two or more cloud services or vendors for the same use case. How did you structure the evaluation and make the final recommendation?

Accepted Answer

Candidate should describe defining evaluation criteria upfront (cost, operational complexity, lock-in, performance, team familiarity), building a proof of concept for each option, involving the relevant stakeholders in the review, and writing a clear recommendation with the tradeoffs documented. Strong answers mention what they would revisit if the initial choice proves wrong.

Question 30

Describe a time you had to prepare AWS or cloud infrastructure for an external compliance audit (SOC 2, ISO 27001, PCI-DSS, or similar). What did you do and what surprised you?

Accepted Answer

Candidate should describe the compliance framework, the evidence collection process (CloudTrail logs, Config compliance reports, IAM access reviews), how they worked with security and legal, and what gaps they found and closed. A strong answer mentions the operationalisation after the audit to keep evidence collection continuous rather than point-in-time.

Question 31

You arrive Monday morning to an alert that your AWS bill for Sunday was double the typical daily spend. You have no immediate idea what caused it. Walk me through your first 30 minutes.

Accepted Answer

Candidate should describe checking Cost Explorer by service, then by linked account and region to isolate the spike, then drilling into CloudTrail for CreateResource events in that time window. Common culprits: runaway EC2 Spot Fleet, data transfer spike, new NAT Gateway traffic, or a DDoS generating LB requests. Good answers also describe who they notify immediately and whether they set a Budget alert to catch future spikes.

Question 32

A colleague ran terraform state rm on the wrong resource and now Terraform shows a plan to create 30 resources that already exist. Production is running fine. What do you do?

Accepted Answer

Candidate should describe immediately locking the workspace to prevent further applies, restoring from the S3 bucket versioned state backup (S3 versioning is why you enable it), verifying the restored state with terraform plan to confirm 0 changes, and doing a blameless postmortem. If the backup predates other legitimate changes, they need to selectively terraform import the missing resources rather than restoring wholesale.

Question 33

During a routine audit you find that a production EC2 security group has port 22 open to 0.0.0.0/0. The engineer who owns it is on holiday. What do you do?

Accepted Answer

Candidate should describe assessing the risk (how long has this been open, any sign of access in CloudTrail or VPC Flow Logs, is the instance publicly accessible), then closing the rule immediately without waiting for the owner since the risk is clear. They should notify the owner's manager and open a ticket. Strong answers mention checking whether there are other security groups with similar rules (Config rule, Security Hub control) to find the scope of the problem.

Question 34

AWS us-east-1 is experiencing a major outage affecting your primary deployment. Your CEO is asking for an ETA on recovery. What do you do in the first hour?

Accepted Answer

Candidate should distinguish between failing over to a secondary region (if DR architecture exists) versus waiting for AWS to recover. First steps: check AWS Service Health Dashboard and Status page, assess which customers are affected and communicate status honestly, evaluate whether failover to the secondary region is feasible given RTO/RPO, and what manual steps that failover requires. Strong answers note that if no DR exists, now is the wrong time to design one.

Question 35

A terraform plan on a stable production module suddenly shows 5 resource changes that no one made in code. How do you investigate and decide whether to apply?

Accepted Answer

Candidate should describe comparing the plan output against recent git commits to rule out code changes, then checking CloudTrail for manual console changes to those resources (drift), and using terraform show to compare current state with what terraform is proposing. If drift is confirmed, they should terraform import or update state to reflect reality rather than blindly applying the plan which may destroy the manual change.

Question 36

You get paged at 2 AM because Lambda invocations are timing out across multiple functions. P99 duration jumped from 200ms to 30 seconds in the last 15 minutes. What do you do?

Accepted Answer

Candidate should first check if a downstream dependency is the cause (RDS, DynamoDB, external API) by looking at X-Ray traces to find where time is spent. Also check if VPC-attached Lambdas lost connectivity (DNS resolution failure, security group change, subnet exhaustion). Immediate mitigation may be to increase Lambda timeout temporarily, but the real fix requires identifying the bottleneck. Candidate should also check if there was a recent deployment that changed the functions.

Question 37

A team launched a new service last month. Their first AWS bill is 10 times the estimate they gave you. They had no budget alerts configured. How do you handle this conversation and what systemic changes do you make?

Accepted Answer

Candidate should describe getting the data first (Cost Explorer by service for that account), understanding what drove the overage (data transfer, NAT Gateway, over-provisioned instances) before the conversation, then presenting findings factually without blame. Systemic change: mandatory Budget alerts before any service goes live, FinOps tagging validation in CI/CD, and a cloud cost review at launch checklist.

Question 38

A CVE is published for the base AMI used by 50 production EC2 instances. The vendor has released a patched AMI. You need to roll out the patch with no more than 5 minutes of downtime per instance. Walk me through the plan.

Accepted Answer

Candidate should describe building and testing a new AMI with the patch using a golden AMI pipeline (Packer), updating the Launch Template with the new AMI version, and using an Auto Scaling Group instance refresh with a minHealthyPercentage of 80% so a small batch is replaced at a time. For instances not in an ASG, they should use SSM Run Command to apply the patch in-place where possible, or use an Instance Refresh analogue. Communication plan to stakeholders is also expected.

Questions

Tell me about a production incident you owned end-to-endBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Describe a time you significantly reduced cloud spendBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Describe a security versus delivery speed conflict you navigatedBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about a large-scale migration project you ledBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about influencing IaC standards across multiple teamsBehaviouralmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

Describe improving an on-call experience that was burning people outBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about pushing back on a flawed architecture decisionBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about building something that made other engineers more productiveBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Describe prioritising cloud infrastructure technical debtBehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about evaluating and choosing between two cloud servicesBehaviouraleasyCommon

As asked

Sample answer outline

Expect these follow-ups

Tell me about preparing cloud infrastructure for a compliance auditBehaviouralmediumOccasional

As asked

Sample answer outline

Expect these follow-ups

What would you do if AWS spend suddenly doubled overnight?BehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

What would you do if Terraform state became corrupted in production?BehaviouralhardOccasional

As asked

Sample answer outline

Expect these follow-ups

What would you do if you discovered 0.0.0.0/0 on port 22 in production?BehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How would you respond if an AWS region becomes unavailable?BehaviouralhardOccasional

As asked

Sample answer outline

Expect these follow-ups

What do you do when terraform plan shows unexpected resource changes?BehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Lambda functions are timing out suddenly. What is your approach?BehaviouralhardCommon

As asked

Sample answer outline

Expect these follow-ups

A new team's first month bill is 10x the estimate. What do you do?BehaviouralmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How do you handle 50 EC2 instances running an AMI with a critical CVE?BehaviouralhardOccasional

As asked

Sample answer outline

Expect these follow-ups

Tell me about a production incident you owned end-to-endBehaviouralmediumVery common

As asked

Sample answer outline