Search Engine 1 - Web Crawler & Index | Databases System Design Challenge

Problem Statement

FindIt is building a niche search engine for technical documentation (think a focused Google for developer docs). The system needs:

- Web crawler - a distributed crawler that discovers and downloads web pages. It must respect robots.txt, handle rate limiting per domain, avoid duplicate URLs, and re-crawl pages periodically to keep the index fresh.•Indexing pipeline - parse downloaded HTML, extract text content, build an inverted index mapping keywords → document IDs with term frequency and position data.•Search API - accept a keyword query and return the top-10 most relevant documents ranked by TF-IDF, with snippets showing query terms in context.•URL frontier - a priority queue that determines which URLs to crawl next, balancing freshness, importance, and politeness (don't hammer any single domain).

The target corpus is 10 million pages, re-crawled on a 7-day cycle.

What You'll Learn

Design a web crawler that indexes 10 M pages and serves keyword search results in < 200 ms. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesMessage QueuesStorageAPI Design

Constraints

Indexed pages~10,000,000

Crawl rate~100 pages/second

Re-crawl cycle7 days

Index size (compressed)~500 GB

Search latency< 200 ms

Concurrent search queries~1,000/sec

Availability target99.9%

Learn the Concept

Databases Topic Hub Message Queues Topic Hub Storage Topic Hub API Design Topic Hub

Related guided labs:

Database Replication & Read Scaling NoSQL & Document Databases Schema Design Workshop

Search Engine 1 - Web Crawler & Index

Problem Statement

What You'll Learn

Constraints

Interview-Ready Approach

1) Clarify Scope and SLOs

2) Capacity Planning Method

3) Architecture Decisions

4) Reliability and Failure Strategy

5) Validation Plan

6) Trade-offs to Call Out in Interviews

Practical Notes

Hints (4)

Learn the Concept

Practice Next