FindIt is building a niche search engine for technical documentation (think a focused Google for developer docs). The system needs:
- Web crawler - a distributed crawler that discovers and downloads web pages. It must respect robots.txt, handle rate limiting per domain, avoid duplicate URLs, and re-crawl pages periodically to keep the index fresh.•Indexing pipeline - parse downloaded HTML, extract text content, build an inverted index mapping keywords → document IDs with term frequency and position data.•Search API - accept a keyword query and return the top-10 most relevant documents ranked by TF-IDF, with snippets showing query terms in context.•URL frontier - a priority queue that determines which URLs to crawl next, balancing freshness, importance, and politeness (don't hammer any single domain).
The target corpus is 10 million pages, re-crawled on a 7-day cycle.
Design a web crawler that indexes 10 M pages and serves keyword search results in < 200 ms. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.
Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.
Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary SQL DB -> Message Queue -> Background Workers -> Object Storage