MediumIntermediate

Log Aggregation System

DatabasesMessage QueuesStorageSearchMonitoring

Problem Statement

LogStream is building a centralized log management platform where engineering teams send application logs for search, alerting, and analysis. Features:

- Log ingestion - accept logs via HTTP API, syslog, and agents installed on servers. Each log entry has timestamp, severity, message, and structured metadata (service name, host, trace ID).Full-text search - search across all logs by keyword, severity, service, and time range. Results in < 3 seconds even when scanning billions of entries.Live tail - stream new log entries in real time (like `tail -f`) filtered by service or keyword.Alerting - define alert rules (e.g., "alert if error count > 100 in 5 minutes for service=payments"). Notify via Slack, email, or PagerDuty.Retention & archival - hot storage for 7 days (fast search), warm storage for 30 days (slower search), cold archive for 1 year (restore-on-demand).Log patterns - automatically detect and group similar log messages into patterns ("Connection timeout from [IP]" appears 50,000 times today).

Ingest 1 TB of logs per day from 200 services across 5,000 servers.

What You'll Learn

Design a centralized logging system (like Datadog Logs) that ingests, indexes, and searches 1 TB of logs per day. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesMessage QueuesStorageSearchMonitoring

Constraints

Log volume/day~1 TB
Log entries/day~5,000,000,000
Services~200
Search latency (7-day range)< 3 seconds
Live tail latency< 2 seconds
Hot retention7 days
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a centralized logging system (like Datadog Logs) that ingests, indexes, and searches 1 TB of logs per day.
  • Design for a peak load target around 80,000 RPS (including burst headroom).
  • Log volume/day: ~1 TB
  • Log entries/day: ~5,000,000,000
  • Services: ~200
  • Search latency (7-day range): < 3 seconds
  • Live tail latency: < 2 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Storage: Use object storage for large blobs and keep metadata/authorization separate in the API tier.
  • Search: Use primary store for writes and async index updates for search relevance + scale.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Enforce lifecycle policies, retention tiers, and checksum validation.
  • Track indexing lag and support reindex from source of truth.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Storage: Object storage is cheap and durable, but random low-latency reads are weaker than databases/caches.
  • Search: Search index gives rich querying but introduces eventual consistency and index ops overhead.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • Elasticsearch for hot storage (7 days), with index-per-day rotation. Delete old indices automatically.
  • Ingest via Kafka - decouple producers from the indexing pipeline. Kafka also acts as a buffer during traffic spikes.
  • Live tail: subscribe to the Kafka topic with filters, push matching entries via WebSocket to the browser.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Service -> Primary SQL DB -> Message Queue -> Background Workers -> Object Storage -> Search Index

Design strengths

  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.