MediumSearch Engine · Part 1

Search Engine 1 - Web Crawler & Index

DatabasesMessage QueuesStorageAPI Design

Problem Statement

FindIt is building a niche search engine for technical documentation (think a focused Google for developer docs). The system needs:

- Web crawler - a distributed crawler that discovers and downloads web pages. It must respect robots.txt, handle rate limiting per domain, avoid duplicate URLs, and re-crawl pages periodically to keep the index fresh.Indexing pipeline - parse downloaded HTML, extract text content, build an inverted index mapping keywords → document IDs with term frequency and position data.Search API - accept a keyword query and return the top-10 most relevant documents ranked by TF-IDF, with snippets showing query terms in context.URL frontier - a priority queue that determines which URLs to crawl next, balancing freshness, importance, and politeness (don't hammer any single domain).

The target corpus is 10 million pages, re-crawled on a 7-day cycle.

What You'll Learn

Design a web crawler that indexes 10 M pages and serves keyword search results in < 200 ms. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesMessage QueuesStorageAPI Design

Constraints

Indexed pages~10,000,000
Crawl rate~100 pages/second
Re-crawl cycle7 days
Index size (compressed)~500 GB
Search latency< 200 ms
Concurrent search queries~1,000/sec
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a web crawler that indexes 10 M pages and serves keyword search results in < 200 ms.
  • Design for a peak load target around 150 RPS (including burst headroom).
  • Indexed pages: ~10,000,000
  • Crawl rate: ~100 pages/second
  • Re-crawl cycle: 7 days
  • Index size (compressed): ~500 GB
  • Search latency: < 200 ms

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Apply strict input validation and backward-compatible versioning.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.

Practical Notes

  • The URL frontier is essentially a distributed priority queue - consider using a message broker with priority support.
  • Bloom filters can efficiently detect already-crawled URLs without storing the full URL set in memory.
  • An inverted index can be stored as sorted posting lists - postings for a term are the list of doc IDs containing it.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary SQL DB -> Message Queue -> Background Workers -> Object Storage

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.