Step 1 — Clarify the requirements
Never start drawing boxes. A strong candidate spends the first few minutes scoping the problem so the design that follows is justified. For a rate limiter, the questions worth asking are:
- Are we limiting per user, per IP, per API key, or per endpoint?
- Is the limiter a separate service, a middleware/library, or part of an API gateway?
- What should happen on a limit breach: reject with 429, queue, or throttle?
- Does the limit need to be exact, or is a small overshoot acceptable?
Functional requirements
- Allow or reject each request based on a configurable limit and window.
- Return clear feedback on rejection (HTTP 429 plus Retry-After / rate-limit headers).
- Support different limits per client tier and per endpoint.
Non-functional requirements
- Low latency: the check is on the hot path of every request, so it must add minimal overhead.
- Accurate enough under concurrency without becoming a bottleneck itself.
- Fault-tolerant: if the limiter store is down, fail open or closed by deliberate policy.
Step 2 — Back-of-the-envelope estimates
Sizing the system tells you which parts are hard. Round aggressively and state your assumptions out loud; the numbers matter less than showing you can reason about scale.
| Metric | Estimate | Reasoning |
|---|---|---|
| Per-request overhead budget | < 1-2 ms | The limiter sits in front of real work; anything slower hurts every request. |
| Counter operations | 1 read-modify-write per request | Each request increments and checks a counter, so the store must handle peak RPS. |
Step 3 — Data model and API
A compact data model and a small API surface anchor the rest of the discussion. Keep both minimal; you can always extend them when the interviewer pushes.
Core entities
counter (Redis)
key = {client_id}:{window}, value = count, TTL = window length
Atomic INCR with expiry implements a fixed window cheaply.
rules
client_tier, endpoint, limit, window_seconds, algorithm
Config store the limiter loads; cache it locally and refresh.
API sketch
- POST
/check— Internal: returns allow/deny for a request key (used by gateway/middleware). - GET
/limits/{client_id}— Inspect remaining quota for a client.
Step 4 — High-level design
Sketch the happy path end to end before optimising anything. This is the architecture you would draw on the whiteboard first:
- 1Place the limiter at the API gateway or as middleware so it runs before business logic.
- 2On each request, derive a key (client + window), atomically increment a counter in a shared fast store.
- 3Compare the count to the configured limit; allow or return 429 with rate-limit headers.
- 4Centralise counters in Redis so all stateless app instances see the same totals.
Step 5 — Deep dives that separate strong answers
The high-level design is table stakes. Interviewers spend most of the time here, probing the decisions that actually carry the system. These are the ones to be ready for.
Choosing the algorithm
Fixed window: one counter per window, reset on rollover. Cheap, but allows up to 2x the limit across a window boundary (a burst at 0:59 and another at 1:00). Sliding window log: store a timestamp per request and count those in the trailing window. Accurate, but memory grows with traffic. Sliding window counter: weight the previous window's count by the overlap fraction. A good accuracy/cost compromise and the common production choice. Token bucket: tokens refill at a fixed rate up to a capacity; each request consumes one. Allows controlled bursts up to the bucket size and is intuitive to explain. Leaky bucket: requests enter a fixed-size queue drained at a constant rate, smoothing output to a steady stream but adding latency. Name two and contrast them; do not just pick one silently.
Distributed consistency
With many app servers, per-instance counters undercount the global rate, so use a shared store like Redis. The increment and check must be atomic, which is why a Lua script (or Redis's atomic INCR + EXPIRE) is preferred over a read-then-write race. To shave latency you can let each node enforce a local approximate limit and sync to the central counter periodically, accepting slight overshoot for speed. Mention clock skew: token-bucket refills computed from server time must use a single source or tolerate drift.
Failure policy
Decide up front whether to fail open (allow traffic if the limiter store is unreachable, prioritising availability) or fail closed (reject, prioritising protection). Most public APIs fail open for a brief outage to avoid turning a cache blip into a full outage, but a payment or auth endpoint may fail closed. State the choice and the reasoning.
Step 6 — Bottlenecks and how to scale past them
Naming where the design breaks, and the specific fix, is what signals seniority. For a rate limiter the pressure points are:
Central counter store becomes the hot path for all traffic.
Shard counters by key, use atomic ops, and add local approximate limiting to reduce round-trips.
Boundary bursts with fixed windows.
Switch to sliding-window counter or token bucket.
Step 7 — Key tradeoffs
There is rarely one right answer. State the tradeoff, then commit to a side with a reason tied to the requirements you clarified in step one.
Algorithm
Token bucket (allows bursts)
Leaky bucket (smooths to constant rate)
Guidance: Token bucket when bursts are acceptable and common; leaky bucket when downstream needs a steady rate.
Accuracy vs latency
Strict central counter (accurate, slower)
Local approximate + periodic sync (fast, slight overshoot)
Guidance: Accept overshoot for high-RPS public APIs; stay strict where exactness is a hard requirement.
Common follow-up questions
When you finish the core design, expect the interviewer to pull on one of these threads. Have a one-paragraph answer ready for each.
- How do you communicate limits back to clients (headers, Retry-After)?
- Fixed window: one counter per window, reset on rollover. Cheap, but allows up to 2x the limit across a window boundary (a burst at 0:59 and another at 1:00). Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How would you support per-endpoint and per-tier limits together?
- With many app servers, per-instance counters undercount the global rate, so use a shared store like Redis. The increment and check must be atomic, which is why a Lua script (or Redis's atomic INCR + EXPIRE) is preferred over a read-then-write race. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How does the design change if the limiter must be exact?
- Decide up front whether to fail open (allow traffic if the limiter store is unreachable, prioritising availability) or fail closed (reject, prioritising protection). Most public APIs fail open for a brief outage to avoid turning a cache blip into a full outage, but a payment or auth endpoint may fail closed. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you rate-limit by a combination of keys (user AND IP)?
- Fixed window: one counter per window, reset on rollover. Cheap, but allows up to 2x the limit across a window boundary (a burst at 0:59 and another at 1:00). Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.