Notion runs a small-team, product-driven loop with an emphasis on craft and judgement. Engineering rounds often probe the block-based data model and the realtime collaboration architecture, and the bar for thoughtful product reasoning is high relative to the company's size.
Process timeline
Reported timeline: 2-4 weeks
1
Recruiter and manager
Background and product motivation.
2
Coding
Practical implementation with clean structure.
3
System design
Block-based data model and realtime collaboration.
4
Product and craft
Judgement about user-facing tradeoffs.
What Notion looks for
What they value
Reasoning about flexible, block-style data models
Product judgement on real tradeoffs
Clean, considered implementation
Culture signals
Craft and taste in a small, dense team
Caring about user-facing simplicity
Curiosity about the data model behind the product
Reported questions
Questions candidates report for this role at this company.
As asked
A new Deployment rollout is stuck at 3 out of 10 pods Ready. Walk me through how you would debug it in production.
Sample answer outline
Start with kubectl rollout status and kubectl describe deployment. Look at the events: ImagePullBackOff, CrashLoopBackOff, FailedScheduling all point in different directions. Check pod events and pod logs for the failing replicas. Common causes: readiness probe failing because the app needs longer to start, resource requests too high for the available nodes, a config map or secret reference that does not exist, or a PodDisruptionBudget blocking eviction of old pods. Resist the urge to delete pods until you know the cause.
What if the readiness probe passes but the service is still returning 503s?
How do you set sane defaults for readiness and liveness probes?
When would you use a startup probe?
kubernetesdebuggingrollout
As asked
Tell me about the worst on-call shift you have personally had. What broke, what did you do in the moment, and what changed afterwards?
Sample answer outline
Pick a real story with measurable user impact. Cover detection (was it self-detected or customer-reported), the diagnosis path (especially the dead ends), the mitigation, and the postmortem actions that actually landed. Interviewers are listening for ownership without blame, calm under pressure, and process changes that prevent the next one. Avoid 'we just restarted it' answers, dig into the root cause.
Expect these follow-ups
What did the postmortem find as the contributing factors?
Did the action items actually ship, or did they slip?
What would you do differently in the first 10 minutes if it happened again?
incidentson-callpostmortem
As asked
An engineer commits an AWS access key to a public GitHub repo. Walk me through what you do in the next hour.
Sample answer outline
Treat the key as compromised. Rotate immediately, do not wait. Revoke the old key, generate a new one, push it via the secret manager so dependent services pick it up. Check CloudTrail for any usage of the key in the window between commit and rotation, especially from unfamiliar IPs. Remove the key from history (git filter-repo or BFG) but understand that GitHub may have already cached it and bots scrape new commits within seconds, so rotation is what matters, not rewriting history. Talk to the engineer without blame, and add pre-commit hooks (gitleaks, trufflehog) so it does not happen again.
Expect these follow-ups
Why is rewriting history less important than rotating?
How would you scan all current repos for past leaks?
What is your policy for repeat offenders?
secretsincident-responsegit
As asked
Tell me about the biggest incident you handled that lived in the platform layer: a broken deploy pipeline, an infrastructure change gone wrong, a certificate expiry, or a cluster failure. Walk me through detection, mitigation, and what changed in the platform afterwards.
Sample answer outline
Frame it through a platform lens, where the blast radius is every team that depends on you. Describe the impact across consumers, not just one service. Detection: what alerted you, and whether it was your monitoring or a downstream team that noticed first. Mitigation: the rollback or break-glass procedure, and whether it existed before the incident or had to be improvised. The strong answer ends with platform-level prevention: a guardrail in the pipeline, a pre-deploy check, an expiry alert, automated rollback. Interviewers listen for ownership of shared infrastructure and the discipline to turn one painful event into a control that protects every team.
Expect these follow-ups
Did a self-service guardrail exist, or did you have to build one after?
How did you communicate with the many teams affected at once?
What pipeline or infrastructure check would have caught this earlier?
incidentsplatformpipelinesownership
As asked
Walk me through how you would design a blue/green deployment pipeline for a stateful API that owns a Postgres database and accepts long-lived WebSocket connections.
Sample answer outline
Stand up two identical environments behind a router (ALB, Envoy, or a service mesh). Migrate the schema in expand/contract phases so blue and green can read and write at the same time. Drain WebSockets gracefully: stop accepting new connections on blue, let clients reconnect to green via DNS or sticky-session routing. Hold the cutover with a small canary slice (1 to 5 percent) and watch SLO burn, error rates, and connection re-establishment metrics. Keep blue warm for the rollback window, then tear it down.
Expect these follow-ups
How does this change if the schema migration is destructive and not backwards compatible?
What if you cannot afford to double-provision the database?
How would you handle a partial cutover failure where 30 percent of traffic is on green?
deploymentsstatefulcutover
As asked
Write Terraform that provisions a VPC across three availability zones with public and private subnets and a NAT gateway per AZ. Explain the cost tradeoffs.
Sample answer outline
Use a module that takes a CIDR and a count of AZs and emits subnet pairs per AZ. One NAT per AZ keeps egress availability isolated to the AZ but triples the NAT bill. A single shared NAT is cheaper but means an AZ outage on the NAT side takes down outbound traffic for everyone. Tag everything with owner and cost-centre so the bill is debuggable. Use remote state with locking, plan in CI, apply manually for shared environments.
How would you swap the NAT gateway for a NAT instance and when is that worth it?
Where do VPC endpoints fit into the cost story?
How do you handle drift when someone edits the console by hand?
terraformawsnetworking
DevOps engineer interview detail at Notion
How the Notion loop applies to DevOps engineer candidates
Notion is a late-stage unicorn headquartered in San Francisco, and the same 4-stage process described above is what a devops engineer candidate walks through, with the technical stages tuned to the infrastructure discipline. Notion runs a small-team, product-driven loop with an emphasis on craft and judgement. Engineering rounds often probe the block-based data model and the realtime collaboration architecture, and the bar for thoughtful product reasoning is high relative to the company's size.
For a devops engineer, the load concentrates on coding and system design. Those are the stages where the infrastructure signal is read most closely, so they are where preparation pays off most. The non-technical stages (recruiter and manager and product and craft) still gate the offer, but they assess fit and communication rather than role-specific depth.
What the devops engineer question mix signals
The 6 most-reported devops engineer questions cluster around role-specific (3), behavioural (2), system design (1). That distribution is the clearest read on what Notion actually probes for this role: the more a topic recurs, the more reliably it shows up in the loop, so it is worth weighting practice the same way.
The set spans a easy-to-medium-to-hard difficulty range, topping out at hard problems. Because the topics are concentrated rather than scattered, depth in the leading area matters more than breadth for this particular role.
What moves a devops engineer offer forward at Notion
Across the loop, the traits that consistently move a Notion devops engineer offer forward are reasoning about flexible, block-style data models, product judgement on real tradeoffs, and clean, considered implementation. These are not abstract values; interviewers score against them, so a devops engineer who demonstrates them explicitly — naming the tradeoff, stating the assumption, checking the edge case out loud — reads stronger than one who only reaches the right answer silently.
The behavioural and culture stages are checking for craft and taste in a small, dense team, caring about user-facing simplicity, and curiosity about the data model behind the product. For a devops engineer, the most credible way to show these is through specific, recent examples from real infrastructure work rather than rehearsed generalities.
How to read the devops engineer salary band
The salary signal shown for this role is the approximate senior median of $291,000 in San Francisco, reported as total compensation including bonus and equity and sourced from BLS, ONS, and Levels.fyi reference data. It is a market band for the devops engineer role and city, not a Notion offer.
San Francisco carries a cost-of-living index of 112 on the scale where New York City equals 100, so read the headline figure alongside that index when comparing it with another market. Individual pay at Notion varies by level, team, equity refresh, and negotiation, which the open salary breakdown for this role lays out city by city.