As asked
A highly anticipated Netflix original releases at midnight UTC. You expect 10 to 20 times normal play traffic in the first 30 minutes. Walk me through how the platform team prepares for this, what failure modes are most likely, and what your runbook looks like for the on-call engineer.
Sample answer outline
A strong answer covers: pre-seeding Open Connect appliances with the encoded assets the day before, load testing the API gateway and recommendation service under 20x traffic simulations, enabling feature flags to shed non-critical work (recommendations, social features) and return cached or default responses if services degrade, pre-scaling autoscaling groups based on expected request curves, and a runbook that defines the on-call escalation path, the synthetic transaction monitors that alert within 30 seconds of a play failure, and the rollback plan if a deploy is found to be the cause.
Expect these follow-ups
- The play request volume is normal but users are reporting the title is not appearing on their homepage. What do you investigate first?
- How do you communicate status to a global engineering team during an incident at 2am when half the team is asleep?