MediumIntermediate

Real-Time Analytics Dashboard

AnalyticsDatabasesMessage QueuesCachingWebSockets

Problem Statement

MetricFlow is building a real-time analytics platform (like a simplified Mixpanel/Amplitude). Customers embed a JavaScript SDK that sends user interaction events. The system then:

- Event ingestion - receive clickstream events (page view, button click, form submit, custom events) at 100,000 events per second at peak.Real-time aggregation - compute metrics in real time: active users right now, events per minute, top pages, funnel conversion rates.Dashboards - a web dashboard showing live-updating charts and counters. Data freshness: < 5 seconds from event to dashboard.Historical queries - ad hoc queries over the last 90 days of data: "Show me daily active users by country for the last month."Segmentation - filter events by user properties (plan, country, device) and event properties.Funnel analysis - define multi-step funnels (e.g., Landing → Sign Up → First Purchase) and compute conversion rates.

Handle 100,000 events/second with 90 days retention (~500 billion events total).

What You'll Learn

Design a real-time analytics pipeline that ingests clickstream events and powers live dashboards with < 5 s latency. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

AnalyticsDatabasesMessage QueuesCachingWebSockets

Constraints

Peak events/second~100,000
Data freshness< 5 seconds
Retention period90 days
Total events stored~500,000,000,000
Dashboard query latency< 2 seconds
Concurrent dashboards~5,000
Availability target99.9%
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a real-time analytics pipeline that ingests clickstream events and powers live dashboards with < 5 s latency.
  • Design for a peak load target around 750 RPS (including burst headroom).
  • Peak events/second: ~100,000
  • Data freshness: < 5 seconds
  • Retention period: 90 days
  • Total events stored: ~500,000,000,000
  • Dashboard query latency: < 2 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Analytics: Maintain separate OLTP and analytics paths; stream events into a warehouse/time-series layer.
  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Message Queues: Move non-blocking and retry-heavy work to async consumers with explicit retry and DLQ policies.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • WebSockets: Use persistent connection gateways and decouple fanout via pub/sub or queues.

4) Reliability and Failure Strategy

  • Version event schemas and monitor drop/late-event rates.
  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Guarantee idempotent consumers and trace every message with correlation IDs.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Track connection churn, backpressure, and session resumption behavior.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Analytics: Analytics pipeline unlocks insights, but adds eventual consistency and governance overhead.
  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Message Queues: Async pipelines absorb spikes well, but increase eventual-consistency complexity.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • WebSockets: WebSockets reduce interaction latency but complicate scaling and state management.

Practical Notes

  • Two paths: a real-time path (Kafka → Flink/stream processor → pre-computed counters in Redis) for live dashboards, and a batch path (Kafka → data lake → columnar store) for historical queries.
  • Use a columnar database (ClickHouse, Druid, or Apache Pinot) for fast analytical queries over billions of rows.
  • Pre-aggregate common metrics (active users, event counts) in materialized views to avoid scanning raw events on every dashboard load.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> Load Balancer -> API Gateway -> API Service -> Primary NoSQL DB -> Redis Cache -> Event Bus -> Background Workers

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Async queue/event bus isolates bursty workloads and supports retries without blocking synchronous requests.
  • Analytics pipeline is separated from OLTP path to avoid reporting workloads impacting transactions.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.