Step 1 — Clarify the requirements
Never start drawing boxes. A strong candidate spends the first few minutes scoping the problem so the design that follows is justified. For a web crawler, the questions worth asking are:
- What is the goal: search indexing, archiving, or monitoring, and what scale of pages?
- How fresh must the crawl be, and do we recrawl pages?
- Do we only fetch HTML, or also images, PDFs, and rendered JavaScript?
- How politely must we crawl (per-domain rate limits, robots.txt)?
Functional requirements
- Fetch pages starting from seeds and discover new URLs from their links.
- Store page content for downstream indexing.
- Respect robots.txt and per-domain politeness limits.
Non-functional requirements
- Scalable to billions of pages across many fetcher nodes.
- Polite: never overload a single site.
- Robust against duplicates, traps, and malformed content.
Step 2 — Back-of-the-envelope estimates
Sizing the system tells you which parts are hard. Round aggressively and state your assumptions out loud; the numbers matter less than showing you can reason about scale.
| Metric | Estimate | Reasoning |
|---|---|---|
| Pages to crawl | ~1 B / month target | A month-long crawl of a billion pages is ~400 pages/sec sustained. |
| Average page size | ~500 KB | Storage and bandwidth scale with this; a billion pages is ~500 TB raw. |
Step 3 — Data model and API
A compact data model and a small API surface anchor the rest of the discussion. Keep both minimal; you can always extend them when the interviewer pushes.
Core entities
url_frontier
url, priority, domain, next_fetch_at
Prioritised, politeness-aware queue of URLs to fetch.
seen_urls
url_hash (PK)
Dedup set; a Bloom filter fronts it to avoid disk hits.
content_store
url_hash, content_hash, body, fetched_at
Stores page bodies; content_hash detects near-duplicate pages.
API sketch
- POST
/seeds— Add seed URLs to start or extend a crawl. - GET
/status— Crawl progress, frontier depth, error rates.
Step 4 — High-level design
Sketch the happy path end to end before optimising anything. This is the architecture you would draw on the whiteboard first:
- 1A URL frontier hands out URLs to fetcher workers, ordered by priority and politeness.
- 2Fetchers download pages (honouring robots.txt), store content, and pass HTML to a parser.
- 3The parser extracts links, normalises them, filters seen URLs, and feeds new ones back into the frontier.
- 4A dedup layer (Bloom filter + store) prevents re-crawling and detects duplicate content.
Step 5 — Deep dives that separate strong answers
The high-level design is table stakes. Interviewers spend most of the time here, probing the decisions that actually carry the system. These are the ones to be ready for.
The URL frontier and politeness
The frontier is the heart of a crawler: a queue that decides what to fetch next, balancing priority (important or fresh pages first) against politeness (do not hammer one domain). A common design uses two layers: front queues for prioritisation and back queues mapped per domain, each with a delay so a single host is fetched no faster than its robots.txt crawl-delay or a default. This prevents the crawler from accidentally DoSing a site, which is both rude and a fast way to get blocked. Respect robots.txt and cache it per domain.
Deduplication at scale
Two dedup problems: URL dedup (do not enqueue the same URL twice) and content dedup (do not store the same page reachable via different URLs). For URLs, normalise (lowercase host, strip fragments, sort query params) then check a set; a Bloom filter in front gives a fast probabilistic 'definitely new / maybe seen' answer and keeps the exact set off the hot path. For content, hash the body (or use a similarity hash like SimHash for near-duplicates) so mirror pages and boilerplate-only differences collapse. Without this, a crawler wastes most of its capacity re-fetching the same content.
Distributed coordination and traps
At billions of pages, fetching is distributed across many workers, partitioned by domain hash so politeness state for a domain lives in one place. Workers are stateless and pull from the frontier; the frontier and dedup stores are the shared state. Guard against crawler traps: infinite calendars, session-id query loops, and deliberately deep generated link mazes. Defences include depth limits, per-domain page caps, URL-pattern filters, and detecting low-information pages. Handle failures with retries and a dead-letter list for persistently broken URLs.
Step 6 — Bottlenecks and how to scale past them
Naming where the design breaks, and the specific fix, is what signals seniority. For a web crawler the pressure points are:
DNS resolution and network I/O dominate latency.
Cache DNS, use high-concurrency async fetchers, and co-locate fetchers near targets.
Dedup set too large for memory.
Bloom filter front end plus a sharded on-disk set; accept a tiny false-positive rate.
Crawler traps waste capacity.
Depth/page caps per domain, URL-pattern filters, near-duplicate detection.
Step 7 — Key tradeoffs
There is rarely one right answer. State the tradeoff, then commit to a side with a reason tied to the requirements you clarified in step one.
Freshness vs coverage
Recrawl known pages often (fresh)
Discover new pages (broad)
Guidance: Prioritise the frontier by change frequency and importance; budget capacity between the two.
Dedup exactness
Exact set (correct, memory-heavy)
Bloom filter (compact, false positives)
Guidance: Bloom filter front end backed by an exact store; a rare skipped page is acceptable.
Common follow-up questions
When you finish the core design, expect the interviewer to pull on one of these threads. Have a one-paragraph answer ready for each.
- How would you crawl JavaScript-rendered pages?
- The frontier is the heart of a crawler: a queue that decides what to fetch next, balancing priority (important or fresh pages first) against politeness (do not hammer one domain). A common design uses two layers: front queues for prioritisation and back queues mapped per domain, each with a delay so a single host is fetched no faster than its robots.txt crawl-delay or a default. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you keep a continuously refreshed index rather than a one-shot crawl?
- Two dedup problems: URL dedup (do not enqueue the same URL twice) and content dedup (do not store the same page reachable via different URLs). For URLs, normalise (lowercase host, strip fragments, sort query params) then check a set; a Bloom filter in front gives a fast probabilistic 'definitely new / maybe seen' answer and keeps the exact set off the hot path. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you prioritise which pages to (re)crawl?
- At billions of pages, fetching is distributed across many workers, partitioned by domain hash so politeness state for a domain lives in one place. Workers are stateless and pull from the frontier; the frontier and dedup stores are the shared state. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.
- How do you detect and avoid spider traps automatically?
- The frontier is the heart of a crawler: a queue that decides what to fetch next, balancing priority (important or fresh pages first) against politeness (do not hammer one domain). A common design uses two layers: front queues for prioritisation and back queues mapped per domain, each with a delay so a single host is fetched no faster than its robots.txt crawl-delay or a default. Sketch the change against the high-level design above and tie your choice back to the requirements you clarified, rather than reaching for the most complex option.