Step 1 — Clarify the requirements
Never start drawing boxes. A strong candidate spends the first few minutes scoping the problem so the design that follows is justified. For a search autocomplete system, the questions worth asking are:
- How many suggestions do we return, and ranked by what (popularity, personalisation)?
- What latency budget per keystroke (typically under ~100 ms end-to-end)?
- How fresh must suggestions be: do trending queries need to appear quickly?
- Do we support typo tolerance and multi-language input?
Functional requirements
- Return the top-k completions for a prefix as the user types.
- Rank suggestions by popularity (and optionally personalisation).
- Update the suggestion set from real query traffic.
Non-functional requirements
- Very low latency per keystroke; the UX feels instant or not at all.
- Scale to high query volume across many users.
- Reasonable freshness so trending terms surface.
Step 2 — Back-of-the-envelope estimates
Sizing the system tells you which parts are hard. Round aggressively and state your assumptions out loud; the numbers matter less than showing you can reason about scale.
| Metric | Estimate | Reasoning |
|---|---|---|
| Latency budget | < 100 ms per keystroke | Each character typed is a request; slow suggestions are worse than none. |
| Read:write ratio | massively read-heavy | Serving suggestions vastly outnumbers updating the index from logs. |
Step 3 — Data model and API
A compact data model and a small API surface anchor the rest of the discussion. Keep both minimal; you can always extend them when the interviewer pushes.
Core entities
trie nodes
prefix -> top-k suggestions (precomputed) with scores
Cache top-k at each node so a query is a single lookup, not a subtree walk.
query_logs
query, count, last_seen
Aggregated source of popularity, fed by the build pipeline.
API sketch
- GET
/api/v1/suggest?q={prefix}— Return ranked completions for the prefix.
Step 4 — High-level design
Sketch the happy path end to end before optimising anything. This is the architecture you would draw on the whiteboard first:
- 1Build path: aggregate query logs offline to compute each term's popularity and assemble a trie.
- 2Precompute and store the top-k completions at every prefix node so serving is one lookup.
- 3Serve path: a request resolves the prefix and returns the cached top-k from an in-memory store.
- 4Periodically rebuild/merge the trie to fold in new and trending queries.
Step 5 — Deep dives that separate strong answers
The high-level design is table stakes. Interviewers spend most of the time here, probing the decisions that actually carry the system. These are the ones to be ready for.
Trie with precomputed top-k
A trie stores strings as a tree of characters, so finding all completions of a prefix means walking to the prefix node and collecting its subtree. Doing that subtree walk on every keystroke is too slow at scale, so the key optimisation is to precompute and cache the top-k suggestions directly at each node. Then a query is: navigate to the prefix node (O(prefix length)) and return its stored list, with no traversal at serve time. This trades extra storage and build cost for the strict latency the feature demands.
Build pipeline and ranking
Popularity comes from real usage, so a batch (or near-real-time streaming) pipeline aggregates query logs into counts, optionally time-decayed so trending terms rise and stale ones fade. The pipeline rebuilds the trie (or updates it incrementally) with refreshed top-k lists. Because rebuilding on every query is impossible, accept that suggestions lag traffic by minutes to hours for the bulk index, and layer a faster path for trending terms if real-time freshness is required. Ranking can blend raw frequency with recency and, for logged-in users, personalisation.
Serving at scale, sharding, and caching
The serving tier keeps the trie (or its top-k tables) in memory for sub-millisecond lookups, fronted by an aggressive cache because prefixes are heavily skewed (a tiny set of prefixes covers most traffic). Shard the trie by first character or prefix range across nodes so it fits in memory and load spreads, routing each request to the owning shard. A CDN or edge cache can even serve the most common prefixes. The result is a system where the expensive work is all offline and the online path is a cached memory read.
Step 6 — Bottlenecks and how to scale past them
Naming where the design breaks, and the specific fix, is what signals seniority. For a search autocomplete system the pressure points are:
Subtree traversal per keystroke is too slow.
Precompute top-k at each trie node; serve is then a single lookup.
Trie too large for one node's memory.
Shard by prefix; keep shards in memory and route per request.
Trending terms lag the batch rebuild.
Add a streaming fast-path layer for recent/trending queries.
Step 7 — Key tradeoffs
There is rarely one right answer. State the tradeoff, then commit to a side with a reason tied to the requirements you clarified in step one.
Freshness vs build cost
Batch rebuild (cheap, lags traffic)
Streaming updates (fresh, complex)
Guidance: Batch for the bulk index; stream a small trending layer if real-time matters.
Storage vs latency
Walk subtree at query time (compact, slow)
Precompute top-k per node (fast, larger)
Guidance: Precompute: the latency requirement dominates the storage cost.
Common follow-up questions
When you finish the core design, expect the interviewer to pull on one of these threads. Have a one-paragraph answer ready for each.
- How would you add typo tolerance (fuzzy matching)?
- A trie stores strings as a tree of characters, so finding all completions of a prefix means walking to the prefix node and collecting its subtree. Doing that subtree walk on every keystroke is too slow at scale, so the key optimisation is to precompute and cache the top-k suggestions directly at each node. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you personalise suggestions per user?
- Popularity comes from real usage, so a batch (or near-real-time streaming) pipeline aggregates query logs into counts, optionally time-decayed so trending terms rise and stale ones fade. The pipeline rebuilds the trie (or updates it incrementally) with refreshed top-k lists. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you handle multi-word and multi-language prefixes?
- The serving tier keeps the trie (or its top-k tables) in memory for sub-millisecond lookups, fronted by an aggressive cache because prefixes are heavily skewed (a tiny set of prefixes covers most traffic). Shard the trie by first character or prefix range across nodes so it fits in memory and load spreads, routing each request to the owning shard. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How quickly can a brand-new trending query appear, and how?
- A trie stores strings as a tree of characters, so finding all completions of a prefix means walking to the prefix node and collecting its subtree. Doing that subtree walk on every keystroke is too slow at scale, so the key optimisation is to precompute and cache the top-k suggestions directly at each node. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.