MediumIntermediate

Managed DNS Service

DatabasesGeo DistributionCachingMonitoringAPI Design

Problem Statement

NameServe is building a managed DNS service (like Route 53 / Cloudflare DNS). Customers delegate their domains to NameServe's nameservers for authoritative DNS hosting. Features:

- Zone management - API and dashboard for managing DNS records (A, AAAA, CNAME, MX, TXT, SRV, etc.) for customer domains.Low-latency resolution - DNS queries must be answered in < 10 ms from the nearest PoP. Deploy authoritative DNS servers at 30+ global locations.Health-check routing - check backend endpoints every 30 seconds. If a backend is unhealthy, automatically remove its IP from DNS responses (failover routing).Geo-based routing - return different IPs based on the requester's geographic location (e.g., US users → US servers, EU users → EU servers).Weighted routing - distribute traffic across backends by weight (e.g., 70% to primary, 30% to canary).Fast propagation - when a customer updates a DNS record, the change must reach all nameservers within 30 seconds.DNSSEC - sign zones with DNSSEC to prevent DNS spoofing.

Handle 1 billion DNS queries per day across 100,000 hosted zones.

What You'll Learn

Design a managed DNS service with zone hosting, health-check routing, geo-based routing, and fast propagation. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesGeo DistributionCachingMonitoringAPI Design

Constraints

DNS queries/day~1,000,000,000
Hosted zones~100,000
Global PoPs30+
Query response time< 10 ms
Record propagation< 30 seconds
Health check intervalEvery 30 seconds
Availability target100% (DNS is critical infrastructure)
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a managed DNS service with zone hosting, health-check routing, geo-based routing, and fast propagation.
  • Design for a peak load target around 57,870 RPS (including burst headroom).
  • DNS queries/day: ~1,000,000,000
  • Hosted zones: ~100,000
  • Global PoPs: 30+
  • Query response time: < 10 ms
  • Record propagation: < 30 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • Geo Distribution: Route users to nearest region/edge while keeping write-consistency boundaries explicit.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.
  • Monitoring: Instrument golden signals (latency, traffic, errors, saturation) per tier and per tenant/domain.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Design region failover and data residency controls as first-class requirements.
  • Bound staleness with TTL + invalidation hooks for critical entities.
  • Alert on user-impact SLOs, not only infrastructure metrics.
  • Apply strict input validation and backward-compatible versioning.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • Geo Distribution: Global latency improves, but cross-region consistency and operations become harder.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.
  • Monitoring: Deep observability speeds incident response but raises ingestion and tooling costs.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.

Practical Notes

  • DNS servers are read-heavy - load all zone data into memory for O(1) lookups. At 100 k zones with ~20 records each, the dataset fits in RAM.
  • Propagation: on record change, publish an event to a message bus. All PoPs subscribe and update their in-memory zone data.
  • Anycast routing: announce the same IP prefix from all PoPs. BGP naturally routes queries to the nearest PoP.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> DNS -> Load Balancer -> API Gateway -> API Service -> Primary SQL DB -> Read Model DB -> Redis Cache

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.
  • Monitoring and logs are wired in from day one for rapid incident triage.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.