Step 1 — Clarify the requirements
Never start drawing boxes. A strong candidate spends the first few minutes scoping the problem so the design that follows is justified. For a distributed key-value store, the questions worth asking are:
- What scale of keys and values, and what throughput and latency targets?
- Do we prioritise availability or strong consistency?
- Are values opaque blobs, or do we need range queries?
- What is the durability requirement (how many copies must survive)?
Functional requirements
- put(key, value) and get(key) over a large dataset.
- Scale horizontally by adding nodes.
- Survive node and network failures without data loss.
Non-functional requirements
- Low-latency reads and writes at high throughput.
- Tunable consistency vs availability.
- Even data and load distribution across nodes.
Step 2 — Back-of-the-envelope estimates
Sizing the system tells you which parts are hard. Round aggressively and state your assumptions out loud; the numbers matter less than showing you can reason about scale.
| Metric | Estimate | Reasoning |
|---|---|---|
| Data scale | petabytes, billions of keys | No single node holds it; partitioning is mandatory from the start. |
| Replication factor | N = 3 typical | Three copies tolerate node loss while bounding storage cost. |
Step 3 — Data model and API
A compact data model and a small API surface anchor the rest of the discussion. Keep both minimal; you can always extend them when the interviewer pushes.
Core entities
key-value pair
key, value, version (vector clock / timestamp)
Version metadata enables conflict detection.
ring membership
node_id, token range, virtual nodes
Consistent-hashing ring assigns key ranges to nodes.
API sketch
- GET
/get?key=— Read a value (coordinated across replicas). - PUT
/put— Write a value with chosen consistency level.
Step 4 — High-level design
Sketch the happy path end to end before optimising anything. This is the architecture you would draw on the whiteboard first:
- 1Hash each key onto a consistent-hashing ring; the key's coordinator node and the next N-1 nodes hold replicas.
- 2A write is sent to the replicas; it succeeds when W of N acknowledge.
- 3A read queries replicas and succeeds when R of N respond; the freshest version wins.
- 4Failures are handled with hinted handoff and anti-entropy repair so replicas reconverge.
Step 5 — Deep dives that separate strong answers
The high-level design is table stakes. Interviewers spend most of the time here, probing the decisions that actually carry the system. These are the ones to be ready for.
Partitioning with consistent hashing
To spread billions of keys across nodes and add/remove nodes cheaply, map both keys and nodes onto a hash ring; a key belongs to the first node clockwise from its hash. Adding or removing a node only relocates the keys between it and its neighbour, not the whole dataset, which is the property that makes elastic scaling practical. Plain consistent hashing distributes unevenly with few nodes, so each physical node owns many virtual nodes scattered around the ring, smoothing both data and load. This is the partitioning scheme Dynamo popularised.
Replication and quorum consistency
Each key is replicated to N nodes (the next N-1 after its coordinator). Consistency is tunable with quorum parameters: a write waits for W acknowledgements and a read for R responses. If W + R > N, a read is guaranteed to see the latest acknowledged write (strong-ish consistency); choosing smaller W and R favours latency and availability over freshness. This single knob expresses the CAP/PACELC tradeoff concretely: W=N gives durable writes but stalls if a replica is down; R=1, W=1 is fast and always available but may read stale data.
Conflict resolution and failure handling
With concurrent writes to different replicas during a partition, versions can diverge. Detect this with vector clocks (or last-write-wins timestamps if you accept lost updates) and resolve either automatically (LWW) or by returning siblings for the client to merge, as Dynamo does for shopping carts. Temporary node failure uses hinted handoff: a healthy node accepts the write on behalf of the down one and forwards it later. Permanent divergence is fixed by anti-entropy using Merkle trees to find and repair differing ranges efficiently. Gossip keeps membership and failure detection decentralised.
Step 6 — Bottlenecks and how to scale past them
Naming where the design breaks, and the specific fix, is what signals seniority. For a distributed key-value store the pressure points are:
Uneven key distribution / hot keys.
Virtual nodes for spread; replicate or cache hot keys.
Quorum latency when a replica is slow.
Tune R/W, use hinted handoff, and read from the fastest replicas.
Replica divergence after partitions.
Vector clocks for detection plus Merkle-tree anti-entropy repair.
Step 7 — Key tradeoffs
There is rarely one right answer. State the tradeoff, then commit to a side with a reason tied to the requirements you clarified in step one.
Consistency
Strong (W+R > N, slower)
Eventual (small R/W, faster, may be stale)
Guidance: Tune per workload; carts and counters tolerate eventual, balances do not.
Conflict resolution
Last-write-wins (simple, can lose data)
Vector clocks + client merge (correct, complex)
Guidance: LWW where loss is acceptable; vector clocks when every write must survive.
Common follow-up questions
When you finish the core design, expect the interviewer to pull on one of these threads. Have a one-paragraph answer ready for each.
- How do vector clocks detect concurrent writes, concretely?
- To spread billions of keys across nodes and add/remove nodes cheaply, map both keys and nodes onto a hash ring; a key belongs to the first node clockwise from its hash. Adding or removing a node only relocates the keys between it and its neighbour, not the whole dataset, which is the property that makes elastic scaling practical. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How would you add range queries on top of a hash-partitioned store?
- Each key is replicated to N nodes (the next N-1 after its coordinator). Consistency is tunable with quorum parameters: a write waits for W acknowledgements and a read for R responses. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you rebalance when adding a node without downtime?
- With concurrent writes to different replicas during a partition, versions can diverge. Detect this with vector clocks (or last-write-wins timestamps if you accept lost updates) and resolve either automatically (LWW) or by returning siblings for the client to merge, as Dynamo does for shopping carts. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- What changes if you need strong consistency for every operation?
- To spread billions of keys across nodes and add/remove nodes cheaply, map both keys and nodes onto a hash ring; a key belongs to the first node clockwise from its hash. Adding or removing a node only relocates the keys between it and its neighbour, not the whole dataset, which is the property that makes elastic scaling practical. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.