Question 1

You inherit a cloud estate where every team uses admin-like roles in a shared account. How would you redesign access without blocking delivery?

Accepted Answer

Start by separating environments and blast radii: production from non-production, shared platform from application accounts, and human access from workload identity. Define roles around actions people actually need, then enforce least privilege with permission boundaries, short-lived credentials, and break-glass access that is audited. Use infrastructure-as-code and policy-as-code so access changes are reviewed, repeatable, and testable. Strong answers mention migration sequencing because cutting everyone over in one move usually breaks deployments. Candidates often trip up by describing perfect IAM theory without a path from the current unsafe state.

Question 2

Walk me through exactly what happens inside the Kubernetes scheduler from the moment a pod is created until it is bound to a node. What are the two main phases, what plugins run in each, and where would you intervene if pods were stuck in Pending?

Accepted Answer

Strong answers cover the two-phase model: filtering (predicates) removes ineligible nodes, scoring (priorities) ranks the survivors. The scheduler uses a plugin framework so candidates name real plugins like NodeResourcesFit, PodTopologySpread, TaintToleration, and InterPodAffinity. For Pending pods they would check Events, describe the pod, inspect scheduler logs, and consider whether resource quotas, taints, or missing node selectors are blocking assignment.

Question 3

Explain how Terraform implements state locking when using an S3 backend with DynamoDB. What happens if a lock is never released after a crash, how do you safely recover, and what design choices reduce the blast radius of a corrupted state file?

Accepted Answer

Candidates should describe that Terraform writes a lock entry to a DynamoDB table before any operation and deletes it when done. If the process crashes the lock remains; recovery requires terraform force-unlock with the lock ID. To reduce blast radius: separate state files per environment and per component so a corruption in one area does not affect others, enable S3 versioning to roll back, and use workspaces or directory-per-env patterns.

Question 4

We have 15 product teams sharing a single cluster. Each team owns a set of namespaces. Developers should deploy workloads but not touch cluster-level resources. Platform engineers need cluster-wide read access. SREs need the ability to restart pods anywhere but not modify RBAC. Sketch the Role, ClusterRole, and binding strategy.

Accepted Answer

Strong answers distinguish Roles (namespaced) from ClusterRoles (cluster-wide) and show that ClusterRoles can be bound at namespace scope with RoleBindings. Developers get a ClusterRole with common verbs (get/list/watch/create/delete on pods, deployments, services, configmaps) bound per namespace via RoleBinding. Platform engineers get a ClusterRole with read-only verbs on all resources via ClusterRoleBinding. SREs get a ClusterRole with patch/delete on pods only, cluster-wide.

Question 5

A team checks in a 2.4 GB Node.js Docker image. You need to get it below 300 MB and reduce CVE exposure. Walk through the specific changes you would make to the Dockerfile, base image choice, and build pipeline to achieve this.

Accepted Answer

Switch from node:18 to node:18-alpine or a distroless base. Use multi-stage builds to separate the build stage (with devDependencies) from the runtime stage (production deps only). Combine RUN commands to avoid layer bloat, use .dockerignore to exclude node_modules and test files from the build context, and add npm ci --omit=dev (the --only=production flag was deprecated in npm v7). For distroless, copy just the app files and node_modules into the final stage. Add Trivy or Grype in CI to gate on critical CVEs.

Question 6

Your tracing bill is too high, but engineers rely on traces to debug rare checkout failures. How would you change the sampling strategy?

Accepted Answer

Separate head sampling, which decides early and is cheap, from tail sampling, which can keep traces based on outcome after the request completes. Keep all error traces, slow traces, and traces for important business flows, while sampling routine successful traffic at a lower rate. Propagate sampling decisions consistently across services so one request does not produce broken partial traces. Use attributes with bounded cardinality to drive retention rules, and validate with engineers that rare failure modes remain findable. A weak answer simply lowers the global sampling percentage and loses the exact traces people need during incidents.

Question 7

A cluster starts throwing 'etcdserver: mvcc: database space exceeded' errors. Explain what caused this, what compaction and defragmentation do differently, and what steps you take to recover without downtime.

Accepted Answer

Candidates should explain etcd's MVCC model and how old revisions accumulate until compacted. Compaction removes old key revisions from the logical store but the on-disk bbolt file stays large until defragmentation physically reclaims space. Recovery involves running etcdctl compact, then etcdctl defrag on each member one at a time to avoid quorum loss, and setting --auto-compaction-retention to prevent recurrence.

Question 8

We are running Cilium as our CNI. Walk me through the full packet path when pod A on node 1 sends a TCP packet to pod B on node 2. Where does eBPF intercept the packet, what is the role of the BPF map, and how does this differ from the path when using Calico with IPIP encapsulation?

Accepted Answer

A strong answer traces the packet from the veth pair inside the pod namespace, through the TC eBPF hook on the host-side veth, into the Cilium datapath which does an endpoint map lookup, then either routes directly via native routing or encapsulates with VXLAN/Geneve. Compared to Calico IPIP, Cilium avoids encap overhead with eBPF-based routing. Candidates should mention BPF_MAP_TYPE_HASH maps for policy enforcement and the CT (connection tracking) map.

Question 9

Our security team says every namespace must declare a LimitRange and ResourceQuota before any workload can be admitted. How would you implement this using OPA Gatekeeper? Describe the ConstraintTemplate structure, the Rego policy logic, and how you would test it before rolling it out to production clusters.

Accepted Answer

Candidates should describe a ConstraintTemplate with a Rego rule in spec.targets that receives the AdmissionReview object, checks for the existence of LimitRange and ResourceQuota objects in the namespace via the OPA data cache, and denies the request if they are absent. Testing involves the gator CLI for unit testing Rego and running against sample manifests before applying to a staging cluster. They should also mention constraint enforcement actions: deny vs warn vs dryrun.

Question 10

We have a queue-processing service and want HPA to scale based on the number of unprocessed messages in a RabbitMQ queue, exposed via Prometheus. Walk through the full pipeline: how the metric gets from RabbitMQ into Kubernetes HPA, what components are involved, and what happens when the metric server is unavailable.

Accepted Answer

Strong answers trace the path: RabbitMQ exposes metrics via its exporter, Prometheus scrapes them, KEDA or the Prometheus Adapter translates them into the external.metrics.k8s.io API, and HPA queries that API. With KEDA the candidate describes ScaledObject and TriggerAuthentication. When the metric server is unavailable, HPA stops scaling in both directions and sets a condition indicating metrics are unavailable. It does not fall back silently; it preserves the current replica count until metrics become available again, which is why availability of the adapter matters.

Question 11

We use Flux v2 with GitRepository, Kustomization, and HelmRelease objects. Walk me through the reconciliation loop for a HelmRelease: what controllers are involved, what happens when a Helm chart value changes in Git, and how do you debug a HelmRelease stuck in a failed state?

Accepted Answer

Candidates describe the source-controller fetching the Git repo and producing an Artifact, then the helm-controller watching HelmRelease objects, fetching the chart from the Artifact, and running helm upgrade. On failure the HelmRelease status condition shows the error and the controller respects retryInterval. Debugging involves kubectl get helmrelease -o yaml to read status.conditions, flux logs --kind=HelmRelease, and looking at Helm release history with helm history.

Question 12

Explain the scale-up and scale-down logic in cluster autoscaler. What triggers a scale-up? What safeguards prevent aggressive scale-down from evicting critical workloads? We have seen nodes stay at 60% utilization but never scale down. Walk through the likely causes.

Accepted Answer

Scale-up triggers when an unschedulable pod exists and CA simulates whether a new node would fit it. Scale-down fires when a node's requested resources are below the scale-down-utilization-threshold (default 50%) for a sustained period. Common reasons a node never scales down: pods without proper resource requests (CA uses requests not actual usage), pods with PodDisruptionBudgets that block eviction, pods with local storage emptyDir, or DaemonSet pods which CA ignores.

Question 13

A service mesh migration introduced NetworkPolicies and now a frontend pod cannot reach a backend pod in the same namespace. Ingress and egress are enabled. How do you systematically isolate whether the issue is a NetworkPolicy, DNS, a service selector mismatch, or something at the CNI level?

Accepted Answer

Start by exec-ing into the frontend pod and trying curl against the backend's ClusterIP to isolate DNS vs connectivity. Check NetworkPolicy with kubectl get networkpolicy -n <ns> -o yaml and verify the podSelector and namespaceSelector labels match the actual pod labels. Use CNI-specific tools (cilium monitor or Calico calicoctl trace) to trace dropped packets. Verify the service selector matches pod labels with kubectl get endpoints.

Question 14

We use a pre-upgrade Helm hook to run database migrations before a new chart version deploys. Explain exactly when the hook job runs relative to the main manifests, what happens to the job after it completes, and what failure scenarios can leave a release stuck in a pending-upgrade state.

Accepted Answer

Pre-upgrade hooks run after Helm marks the release as pending-upgrade but before any non-hook resources are applied. The hook job is deleted based on its hook-delete-policy annotation (before-hook-creation by default, hook-succeeded, or hook-failed). If the job fails and the policy does not delete it, the next upgrade attempt fails because the job exists. Stuck releases need helm rollback or manual kubectl delete of the hook job.

Question 15

We deploy a stack that requires a Namespace, then a Secret from an external-secrets operator, then a Deployment that reads that secret. Using Argo CD sync waves, how do you guarantee this ordering? What are the limits of sync waves, and what do you use when waves are insufficient?

Accepted Answer

Sync waves use the argocd.argoproj.io/sync-wave annotation with integer values; Argo CD applies resources in wave order and waits for each wave to be healthy before proceeding. The Namespace gets wave 0, the ExternalSecret gets wave 1, the Deployment gets wave 2. Limits: waves do not wait for CRD-created resources to be ready if the health check is not configured. For finer control, sync hooks (PreSync, Sync, PostSync) or resource health checks with custom Lua scripts fill the gap.

Question 16

We have a Kafka cluster with 3 brokers running as a StatefulSet. Explain how you would configure PodDisruptionBudgets to allow cluster autoscaler to scale down nodes while keeping the Kafka cluster available during voluntary disruptions. What value do you choose for minAvailable or maxUnavailable, and why?

Accepted Answer

For a 3-broker Kafka cluster you need at least 2 in-sync replicas to maintain availability with default replication factor 3, so minAvailable: 2 is the right choice. This lets cluster autoscaler evict one pod at a time. The PDB selector must match the StatefulSet pods exactly. Candidates should note that PDB only covers voluntary disruptions and that anti-affinity rules on different nodes prevent the remaining 2 from co-locating, which would leave only 1 node that can be drained.

Question 17

Describe two different patterns for getting a Vault-managed secret into a Kubernetes pod: the Vault Agent Injector sidecar approach and the External Secrets Operator approach. What are the operational tradeoffs, and which would you recommend for a new platform?

Accepted Answer

Vault Agent Injector uses a mutating webhook to inject a sidecar that authenticates via Kubernetes service account, fetches secrets, writes them to a shared in-memory volume, and keeps them rotated. External Secrets Operator creates Kubernetes Secret objects from Vault, which means secrets are persisted in etcd (encrypted at rest if configured) and native tooling like envFrom works without sidecar overhead. ESO is simpler operationally but trades the zero-persistence property of the sidecar approach.

Question 18

We have a Prometheus that is using 120 GB of memory and scrape intervals are timing out. You investigate and find that one team is exposing a metric with a user_id label that has 2 million unique values. Explain what high cardinality means, why it is so destructive, and what you do to fix it both short-term and long-term.

Accepted Answer

Each unique label combination creates a separate time series in the TSDB, so a 2M user_id label generates 2M series just for that one metric. Short-term: add a metric_relabel_config to drop the label or the metric entirely. Long-term: set cardinality limits at the scrape config level, use recording rules to pre-aggregate, move high-cardinality data to a log-based backend or Tempo/Loki, and implement metric review gates in the golden path CI pipeline. Mimir or Thanos with per-tenant cardinality limits also help.

Question 19

We have 20 product teams on a shared cluster with a total of 500 CPU cores and 2 TB RAM. How do you implement ResourceQuotas, LimitRanges, and priority classes so that no single team can starve others, burst capacity is available for urgent workloads, and platform components (ingress, monitoring) are protected?

Accepted Answer

Set per-namespace ResourceQuotas sizing each team's baseline plus a burst allowance. LimitRanges enforce default request/limit ratios so pods without explicit resources still count against the quota. Define three PriorityClasses: system-cluster-critical for platform components (preempts others), high for urgent team workloads, normal for steady-state. The ResourceQuota can scope by priorityClass so bursting teams cannot consume the system-critical priority pool. Candidates should mention the Hierarchical Namespace Controller for nested quota inheritance.

Question 20

A team wants to roll out a new version of their service to 5% of traffic, monitor error rate, and automatically promote or rollback based on a Prometheus metric. Walk through the Istio VirtualService and DestinationRule configuration, the automation layer you would build, and how Argo Rollouts fits into this.

Accepted Answer

A DestinationRule defines two subsets (stable and canary) using pod label selectors. A VirtualService splits traffic by weight (95/5) between subsets. Argo Rollouts replaces the standard Deployment with a Rollout object that manages the CanaryStrategy, integrating with Istio via the rollouts-traffic-routing annotation. It uses AnalysisTemplates to query Prometheus for the error rate metric and calls setWeight to increment canary traffic automatically on success or trigger rollback on failure.

Question 21

We have a cluster with three node groups: on-demand general, spot general, and on-demand GPU. Describe how you use taints, tolerations, nodeSelectors, and nodeAffinity together to ensure GPU workloads run only on GPU nodes, spot-tolerant batch jobs prefer spot but fall back to on-demand, and stateful services never land on spot.

Accepted Answer

Taint GPU nodes with nvidia.com/gpu=present:NoSchedule so only pods with the matching toleration can land there. Taint spot nodes with spot=true:NoSchedule. Spot-tolerant batch jobs add the spot toleration and use preferredDuringSchedulingIgnoredDuringExecution with a weight to prefer spot nodes but not require them. Stateful services simply do not have the spot toleration, so the NoSchedule taint keeps them off spot. Node selectors or nodeAffinity with requiredDuringScheduling enforce GPU workloads to the correct pool.

Question 22

We want all teams to use a standard CI pipeline that handles building, scanning, testing, and deploying container images. The pipeline must be opinionated enough to enforce security gates but flexible enough for teams with different tech stacks. How do you design this? What is your templating strategy, and how do you roll out updates to 50 pipelines without breaking everyone?

Accepted Answer

Use pipeline-as-code templates (GitHub Actions reusable workflows, GitLab CI includes, or Tekton pipelines via a catalog). Teams reference the template at a pinned version tag, consuming it like a library. The template encodes required stages (SAST via Semgrep, image build, Trivy scan, push, deploy) and exposes extension points via inputs. Updates follow a deprecation cycle: the new version is released, teams have a migration window, and automated PRs update their version pins. A renovate bot or internal tooling automates the pin upgrades.

Question 23

A StatefulSet is stuck because its PVCs are in Pending state. Describe your debugging steps: how do you distinguish between a missing StorageClass, a CSI driver issue, a node without the volume plugin installed, and a capacity problem in the underlying storage backend?

Accepted Answer

Start with kubectl describe pvc to read the events: a missing StorageClass shows a 'no provisioner found' error, a CSI driver issue shows provisioner errors in the CSI controller pod logs, a node problem shows no-volume-plugin events and requires checking the DaemonSet for the CSI node driver. For EBS on AWS, check for zone mismatches between the PVC and available nodes. Use kubectl get csinode to verify CSI topology annotations. Storage capacity exhaustion on the backend appears as provisioning failures in the external-provisioner logs.

Question 24

How do you run OPA policies against Kubernetes YAML in a CI pipeline before anything reaches the cluster? Describe the conftest setup, how you write Rego tests for the policies themselves, and how you handle policies that need to query cluster state (like checking whether an image tag already exists in the registry).

Accepted Answer

Conftest reads Kubernetes manifests and evaluates them against Rego policies in a policy/ directory, failing the CI job if any deny rule fires. Rego tests live alongside the policies using the testing.Rego framework with mock input fixtures. For policies that need cluster state (image existence, quota headroom), the CI job must call the relevant API (registry API for tag checks, Kubernetes API for quota) and inject the data as a conftest data document using --data. This keeps policies pure but requires CI to have appropriate read permissions.

Question 25

We are standardizing on OpenTelemetry and need all 50 services to send traces, metrics, and logs to a central backend. Describe how you deploy and configure the OTel Collector as a DaemonSet sidecar versus a deployment, what processors you include in the pipeline, and how you handle backpressure when the backend is slow.

Accepted Answer

A DaemonSet collector per node handles log collection efficiently and avoids inter-node traffic for traces and metrics. A gateway deployment in front of the backend handles batching, sampling, and routing to multiple backends. The pipeline includes the memory_limiter processor first to prevent OOM, then batch for efficiency, then an attributes processor to add cluster and environment metadata. Backpressure is handled via the receiver queue and the exporter retry/queue settings; the memory_limiter sheds data before the collector crashes.

Question 26

You discover that a team has been deploying to production by applying raw YAML directly via kubectl, bypassing your GitOps workflow, Gatekeeper policies, and the image signing requirement. The team says it was faster in an emergency. How do you handle this?

Accepted Answer

Short-term: have a direct conversation with the team to understand the emergency context, without blame. Assess whether the direct apply left the cluster in a state that diverges from Git (it likely did) and reconcile it. Medium-term: address the root cause of why the emergency bypass was needed, which is usually that the GitOps path was too slow or had friction. Long-term: tighten controls so unauthenticated kubectl apply to prod is not possible (RBAC denying direct writes from developer identities), while making the approved path fast enough that bypassing it is not tempting.

Question 27

At 2am on a Wednesday, you get paged because three product teams report high latency. You discover that one team ran a batch job that consumed all available CPU on several nodes, and other teams' pods are throttled. The offending job is business-critical and cannot be killed. What do you do right now, and what do you change afterwards?

Accepted Answer

Immediate: check if any affected pods can be rescheduled to less-loaded nodes (cordon the noisy nodes, but only if other nodes have capacity). Check if the batch job has a CPU limit and if removing a misconfigured limit would help. If the batch job is unbounded, negotiate with the team to add limits or pause briefly. After the incident: add ResourceQuotas and LimitRanges that should have prevented this, add monitoring for per-namespace CPU consumption with alerts before saturation, and discuss priority classes to protect critical workloads.

Question 28

Your security team flags a critical CVE in the base image used by all 200 services running on the platform. You have 48 hours to patch it per your SLA. The fix requires bumping the base image tag, rebuilding all 200 images, and redeploying. How do you execute this, and what does this event reveal about your platform's maturity?

Accepted Answer

This event reveals that if 200 services need manual changes, the platform is not mature enough. The target state is a single base image change in the golden-path Dockerfile triggers automatic rebuild and redeploy of all services. For the immediate 48-hour window: script the base image update across all repos (a Renovate or Dependabot PR, or a bulk git operation), trigger parallel CI builds, and use a phased rollout by environment. The event becomes a forcing function to build automated base image update pipelines (Renovate + auto-merge for patch updates after scan passes).

Question 29

A developer runs terraform plan and sees that a change to a security group rule will trigger a replacement of a production RDS instance, which would cause 15 minutes of downtime. They do not know why and come to you. How do you diagnose the root cause and what options do you give them to make the change without the replacement?

Accepted Answer

Check the plan output carefully for the 'forces replacement' note and identify which attribute triggers it. For an RDS instance, it is often a change to an attribute marked as ForceNew in the provider schema (like engine version, subnet group, or parameter group in some providers). Options: use lifecycle ignore_changes to manage the attribute outside Terraform, perform a blue-green replacement manually and import the new resource, or check if a newer provider version handles the change in-place. Use terraform state show to compare current state to the desired state.

Question 30

You are upgrading from Kubernetes 1.24 to 1.25, which removes several beta API versions including networking.k8s.io/v1beta1 Ingress. A popular vendor Helm chart your organization uses still references this removed API and has not released a fix. The upgrade is scheduled for next week. What are your options?

Accepted Answer

Options in order of preference: (1) Check if the vendor has a release candidate or a community fork that fixes it and test it now. (2) Fork the chart internally, patch the API version, and pin to the forked version. (3) Use helm-mapkubeapis or a post-render hook to rewrite the API version at install time without forking. (4) Delay the upgrade by one week while working the vendor relationship. Option 3 is a good short-term bridge. Whatever you choose, document it in the upgrade runbook and add a dependency check to the upgrade prerequisites checklist.

Questions

Design IAM boundaries for a multi-team cloud accountRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

How does the Kubernetes scheduler assign pods to nodes?Role-specifichardVery common

As asked

Sample answer outline

Expect these follow-ups

How does Terraform state locking work and what breaks it?Role-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design RBAC for a multi-team Kubernetes clusterRole-specificmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Minimize container image size and attack surfaceRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Choose a distributed tracing sampling strategyRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Why does etcd compaction matter for cluster health?Role-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Trace a packet from pod A to pod B on different nodesRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Enforce namespace resource quotas via OPA GatekeeperRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Configure HPA to scale on a custom Prometheus metricRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Explain the Flux reconciliation loop and failure modesRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How does cluster autoscaler decide to add or remove nodes?Role-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Debug a broken NetworkPolicy blocking legitimate trafficRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

How do Helm hooks work and when do they fire?Role-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Use Argo CD sync waves to order resource creationRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Design PodDisruptionBudgets for a stateful applicationRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

Integrate HashiCorp Vault with Kubernetes workloadsRole-specificmediumCommon

As asked

Sample answer outline

Expect these follow-ups

High cardinality metrics are killing your PrometheusRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Implement fair resource allocation for 20 teams on one clusterRole-specifichardCommon

As asked

Sample answer outline

Expect these follow-ups

Implement canary deployments with Istio traffic splittingRole-specificmediumCommon

As asked

Sample answer outline