EasyStarter

Status Page Service

DatabasesAPI DesignMonitoring

Problem Statement

UptimeBoard is building a hosted status page service for SaaS companies. Each customer gets a public page (e.g., `status.example.com`) showing:

- Component status - list of services (API, Dashboard, Database) with status indicators (operational, degraded, outage).Uptime history - a 90-day uptime bar graph per component showing daily/hourly availability.Incidents - admins create incident reports with updates ("Investigating" → "Identified" → "Monitoring" → "Resolved"). Subscribers get email/SMS notifications.Scheduled maintenance - announce upcoming maintenance windows.Health checks - automatic HTTP/TCP/ping checks every 60 seconds. Auto-create incidents when a check fails 3 times in a row.

Targeting 2,000 customers each with a public status page.

What You'll Learn

Design a status page service (like Statuspage.io) showing uptime, incident updates, and health checks. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesAPI DesignMonitoring

Constraints

Customer status pages~2,000
Health checks/minute~20,000
Status page load time< 500 ms
Incident notification delay< 2 minutes
Uptime data retention1 year
Availability target99.9% (must be higher than customers' own uptime!)
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a status page service (like Statuspage.io) showing uptime, incident updates, and health checks.
  • Design for a peak load target around 500 RPS (including burst headroom).
  • Customer status pages: ~2,000
  • Health checks/minute: ~20,000
  • Status page load time: < 500 ms
  • Incident notification delay: < 2 minutes
  • Uptime data retention: 1 year

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Apply strict input validation and backward-compatible versioning.
  • Alert on user-impact SLOs, not only infrastructure metrics.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.

Practical Notes

  • The status page itself must be ultra-reliable - consider serving it from a CDN as static HTML updated every minute.
  • Health check workers should run from multiple geographic locations to avoid false positives from network issues.
  • Store uptime data as 1-minute resolution buckets - aggregate into hourly/daily summaries for the history view.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> API Gateway -> API Service -> Primary SQL DB -> Monitoring -> Log Aggregator

Design strengths

  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.