Notion

DevOps engineer at Notion

Notion runs a small-team, product-driven loop with an emphasis on craft and judgement. Engineering rounds often probe the block-based data model and the realtime collaboration architecture, and the bar for thoughtful product reasoning is high relative to the company's size.

Process timeline

Reported timeline: 2-4 weeks

1
Recruiter and manager
Background and product motivation.
2
Coding
Practical implementation with clean structure.
3
System design
Block-based data model and realtime collaboration.
4
Product and craft
Judgement about user-facing tradeoffs.

What Notion looks for

What they value

Reasoning about flexible, block-style data models
Product judgement on real tradeoffs
Clean, considered implementation

Culture signals

Craft and taste in a small, dense team
Caring about user-facing simplicity
Curiosity about the data model behind the product

Reported questions

Questions candidates report for this role at this company.

As asked

A new Deployment rollout is stuck at 3 out of 10 pods Ready. Walk me through how you would debug it in production.

Sample answer outline

Start with kubectl rollout status and kubectl describe deployment. Look at the events: ImagePullBackOff, CrashLoopBackOff, FailedScheduling all point in different directions. Check pod events and pod logs for the failing replicas. Common causes: readiness probe failing because the app needs longer to start, resource requests too high for the available nodes, a config map or secret reference that does not exist, or a PodDisruptionBudget blocking eviction of old pods. Resist the urge to delete pods until you know the cause.

Reference implementation (bash)

# Triage sequence
kubectl rollout status deployment/api -n prod
kubectl describe deployment/api -n prod
kubectl get pods -n prod -l app=api
kubectl describe pod <pending-pod> -n prod
kubectl logs <crashlooping-pod> -n prod --previous
kubectl get events -n prod --sort-by=.lastTimestamp | tail -20

Expect these follow-ups

What if the readiness probe passes but the service is still returning 503s?
How do you set sane defaults for readiness and liveness probes?
When would you use a startup probe?

kubernetesdebuggingrollout

As asked

Tell me about the worst on-call shift you have personally had. What broke, what did you do in the moment, and what changed afterwards?

Sample answer outline

Pick a real story with measurable user impact. Cover detection (was it self-detected or customer-reported), the diagnosis path (especially the dead ends), the mitigation, and the postmortem actions that actually landed. Interviewers are listening for ownership without blame, calm under pressure, and process changes that prevent the next one. Avoid 'we just restarted it' answers, dig into the root cause.

Expect these follow-ups

What did the postmortem find as the contributing factors?
Did the action items actually ship, or did they slip?
What would you do differently in the first 10 minutes if it happened again?

incidentson-callpostmortem

As asked

An engineer commits an AWS access key to a public GitHub repo. Walk me through what you do in the next hour.

Sample answer outline

Treat the key as compromised. Rotate immediately, do not wait. Revoke the old key, generate a new one, push it via the secret manager so dependent services pick it up. Check CloudTrail for any usage of the key in the window between commit and rotation, especially from unfamiliar IPs. Remove the key from history (git filter-repo or BFG) but understand that GitHub may have already cached it and bots scrape new commits within seconds, so rotation is what matters, not rewriting history. Talk to the engineer without blame, and add pre-commit hooks (gitleaks, trufflehog) so it does not happen again.

Expect these follow-ups

Why is rewriting history less important than rotating?
How would you scan all current repos for past leaks?
What is your policy for repeat offenders?

secretsincident-responsegit

As asked

Tell me about the biggest incident you handled that lived in the platform layer: a broken deploy pipeline, an infrastructure change gone wrong, a certificate expiry, or a cluster failure. Walk me through detection, mitigation, and what changed in the platform afterwards.

Sample answer outline

Frame it through a platform lens, where the blast radius is every team that depends on you. Describe the impact across consumers, not just one service. Detection: what alerted you, and whether it was your monitoring or a downstream team that noticed first. Mitigation: the rollback or break-glass procedure, and whether it existed before the incident or had to be improvised. The strong answer ends with platform-level prevention: a guardrail in the pipeline, a pre-deploy check, an expiry alert, automated rollback. Interviewers listen for ownership of shared infrastructure and the discipline to turn one painful event into a control that protects every team.

Expect these follow-ups

Did a self-service guardrail exist, or did you have to build one after?
How did you communicate with the many teams affected at once?
What pipeline or infrastructure check would have caught this earlier?

incidentsplatformpipelinesownership

As asked

Walk me through how you would design a blue/green deployment pipeline for a stateful API that owns a Postgres database and accepts long-lived WebSocket connections.

Sample answer outline

Stand up two identical environments behind a router (ALB, Envoy, or a service mesh). Migrate the schema in expand/contract phases so blue and green can read and write at the same time. Drain WebSockets gracefully: stop accepting new connections on blue, let clients reconnect to green via DNS or sticky-session routing. Hold the cutover with a small canary slice (1 to 5 percent) and watch SLO burn, error rates, and connection re-establishment metrics. Keep blue warm for the rollback window, then tear it down.

Expect these follow-ups

How does this change if the schema migration is destructive and not backwards compatible?
What if you cannot afford to double-provision the database?
How would you handle a partial cutover failure where 30 percent of traffic is on green?

deploymentsstatefulcutover

As asked

Write Terraform that provisions a VPC across three availability zones with public and private subnets and a NAT gateway per AZ. Explain the cost tradeoffs.

Sample answer outline

Use a module that takes a CIDR and a count of AZs and emits subnet pairs per AZ. One NAT per AZ keeps egress availability isolated to the AZ but triples the NAT bill. A single shared NAT is cheaper but means an AZ outage on the NAT side takes down outbound traffic for everyone. Tag everything with owner and cost-centre so the bill is debuggable. Use remote state with locking, plan in CI, apply manually for shared environments.

Reference implementation (hcl)

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"

  name = "prod"
  cidr = "10.0.0.0/16"

  azs             = ["eu-west-2a", "eu-west-2b", "eu-west-2c"]
  private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"]
  public_subnets  = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"]

  enable_nat_gateway     = true
  single_nat_gateway     = false
  one_nat_gateway_per_az = true

  tags = { owner = "platform", env = "prod" }
}

Expect these follow-ups

How would you swap the NAT gateway for a NAT instance and when is that worth it?
Where do VPC endpoints fit into the cost story?
How do you handle drift when someone edits the console by hand?

terraformawsnetworking

How the Notion loop applies to DevOps engineer candidates

Notion is a late-stage unicorn headquartered in San Francisco, and the same 4-stage process described above is what a devops engineer candidate walks through, with the technical stages tuned to the infrastructure discipline. Notion runs a small-team, product-driven loop with an emphasis on craft and judgement. Engineering rounds often probe the block-based data model and the realtime collaboration architecture, and the bar for thoughtful product reasoning is high relative to the company's size.

For a devops engineer, the load concentrates on coding and system design. Those are the stages where the infrastructure signal is read most closely, so they are where preparation pays off most. The non-technical stages (recruiter and manager and product and craft) still gate the offer, but they assess fit and communication rather than role-specific depth.

What the devops engineer question mix signals

The 6 most-reported devops engineer questions cluster around role-specific (3), behavioural (2), system design (1). That distribution is the clearest read on what Notion actually probes for this role: the more a topic recurs, the more reliably it shows up in the loop, so it is worth weighting practice the same way.

The set spans a easy-to-medium-to-hard difficulty range, topping out at hard problems. Because the topics are concentrated rather than scattered, depth in the leading area matters more than breadth for this particular role.

What moves a devops engineer offer forward at Notion

Across the loop, the traits that consistently move a Notion devops engineer offer forward are reasoning about flexible, block-style data models, product judgement on real tradeoffs, and clean, considered implementation. These are not abstract values; interviewers score against them, so a devops engineer who demonstrates them explicitly — naming the tradeoff, stating the assumption, checking the edge case out loud — reads stronger than one who only reaches the right answer silently.

The behavioural and culture stages are checking for craft and taste in a small, dense team, caring about user-facing simplicity, and curiosity about the data model behind the product. For a devops engineer, the most credible way to show these is through specific, recent examples from real infrastructure work rather than rehearsed generalities.

How to read the devops engineer salary band

The salary signal shown for this role is the approximate senior median of $291,000 in San Francisco, reported as total compensation including bonus and equity and sourced from BLS, ONS, and Levels.fyi reference data. It is a market band for the devops engineer role and city, not a Notion offer.

San Francisco carries a cost-of-living index of 112 on the scale where New York City equals 100, so read the headline figure alongside that index when comparing it with another market. Individual pay at Notion varies by level, team, equity refresh, and negotiation, which the open salary breakdown for this role lays out city by city.

Salary band

Senior p50

$291,000

City

San Francisco

Data type

Total comp

View salary detail

Text-based interactive courses

Recommended

Run code in the browser - no IDE setup. The Grokking series is well-known prep for senior engineering interviews.

Browse Educative

An external resource we recommend. Educative is not affiliated with us and we earn nothing from this link.

Similar companies

# Triage sequence kubectl rollout status deployment/api -n prod kubectl describe deployment/api -n prod kubectl get pods -n prod -l app=api kubectl describe pod <pending-pod> -n prod kubectl logs <crashlooping-pod> -n prod --previous kubectl get events -n prod --sort-by=.lastTimestamp | tail -20

module "vpc" { source = "terraform-aws-modules/vpc/aws" version = "~> 5.0" name = "prod" cidr = "10.0.0.0/16" azs = ["eu-west-2a", "eu-west-2b", "eu-west-2c"] private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] public_subnets = ["10.0.101.0/24", "10.0.102.0/24", "10.0.103.0/24"] enable_nat_gateway = true single_nat_gateway = false one_nat_gateway_per_az = true tags = { owner = "platform", env = "prod" } }

Process timeline

Recruiter and manager

Coding

System design

Product and craft

What Notion looks for

What they value

Culture signals

Reported questions

Debug a Kubernetes rollout that is stuckRole-specificmediumVery common

As asked

Sample answer outline

Reference implementation (bash)

Expect these follow-ups

Tell me about the worst on-call shift you have hadBehaviouraleasyVery common

As asked

Sample answer outline

Expect these follow-ups

Remediate secrets committed to source controlRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Tell me about an incident you handled on the platform or pipeline layerBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a blue/green deployment pipeline for a stateful serviceSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Write Terraform for a multi-AZ VPC with private subnetsRole-specificmediumCommon

As asked

Sample answer outline

Reference implementation (hcl)

Expect these follow-ups

DevOps engineer interview detail at Notion

How the Notion loop applies to DevOps engineer candidates

What the devops engineer question mix signals

What moves a devops engineer offer forward at Notion

How to read the devops engineer salary band

Salary band

Similar companies

Process timeline

Recruiter and manager

Coding

System design

Product and craft

What Notion looks for

What they value

Culture signals

Reported questions

Debug a Kubernetes rollout that is stuckRole-specificmediumVery common

As asked

Sample answer outline

Reference implementation (bash)

Expect these follow-ups

Tell me about the worst on-call shift you have hadBehaviouraleasyVery common

As asked

Sample answer outline

Expect these follow-ups

Remediate secrets committed to source controlRole-specificeasyVery common

As asked

Sample answer outline

Expect these follow-ups

Tell me about an incident you handled on the platform or pipeline layerBehaviouralmediumVery common

As asked

Sample answer outline

Expect these follow-ups

Design a blue/green deployment pipeline for a stateful serviceSystem designhardCommon

As asked

Sample answer outline

Expect these follow-ups

Write Terraform for a multi-AZ VPC with private subnetsRole-specificmediumCommon

As asked

Sample answer outline

Reference implementation (hcl)

Expect these follow-ups

DevOps engineer interview detail at Notion

How the Notion loop applies to DevOps engineer candidates

What the devops engineer question mix signals