EasyStarter

Feature Flag Service

DatabasesAPI DesignCaching

Problem Statement

FlagSwitch is building a feature flag service (like LaunchDarkly lite). Features:

- Boolean flags - simple on/off toggles for features. SDKs (JS, Python, Go) check flag state in < 10 ms.Percentage rollouts - roll out a feature to 5% → 25% → 50% → 100% of users gradually.User targeting - enable a flag for specific user segments (e.g., "enterprise customers" or "internal employees").A/B testing - split traffic between variants (A/B/C) and track conversion metrics.Audit log - record who changed what flag, when, with the ability to roll back.SDK caching - SDKs cache flag values locally and poll for updates every 30 seconds.

Targeting 500 applications with 100,000 flag evaluations per second across all clients.

What You'll Learn

Design a feature flag service that lets teams toggle features, run A/B tests, and do gradual rollouts. Build this architecture under realistic production constraints, then validate tradeoffs in the design lab simulation.

DatabasesAPI DesignCaching

Constraints

Applications / projects~500
Total flags~10,000
Flag evaluations/second~100,000
SDK evaluation time< 10 ms (local cache)
Config propagation delay< 60 seconds
Availability target99.99% (SDKs degrade gracefully)
ApproachClick to expand

Interview-Ready Approach

1) Clarify Scope and SLOs

  • Problem statement: Design a feature flag service that lets teams toggle features, run A/B tests, and do gradual rollouts.
  • Design for a peak load target around 500 RPS (including burst headroom).
  • Applications / projects: ~500
  • Total flags: ~10,000
  • Flag evaluations/second: ~100,000
  • SDK evaluation time: < 10 ms (local cache)
  • Config propagation delay: < 60 seconds

2) Capacity Planning Method

  • Convert traffic and growth constraints into request rate, storage growth, and concurrency budgets.
  • Keep at least 2-3x safety margin per tier (ingress, compute, storage, async workers).
  • Reserve explicit latency budgets per hop so p95 can be defended in review.

3) Architecture Decisions

  • Databases: Define a clear system-of-record and design read/write paths separately before adding optimizations.
  • API Design: Standardize API boundaries, idempotency keys, pagination, and error contracts first.
  • Caching: Put cache on hot read paths first and pick cache-aside or write-through explicitly.

4) Reliability and Failure Strategy

  • Use strong write constraints (transactions or conditional writes) and explicit backup/restore strategy.
  • Apply strict input validation and backward-compatible versioning.
  • Bound staleness with TTL + invalidation hooks for critical entities.

5) Validation Plan

  • Run one peak-load test, one dependency-degradation test, and one failover test.
  • Verify idempotency for all retried writes and async consumers.
  • Track user-facing SLOs first: p95 latency, error rate, and successful throughput.

6) Trade-offs to Call Out in Interviews

  • Databases: SQL gives stronger transactional guarantees; NoSQL often gives better write scaling and flexibility.
  • API Design: Rich APIs improve developer speed but can create long-term compatibility burden.
  • Caching: Higher hit rate cuts latency/cost, but stale data and invalidation bugs become primary risks.

Practical Notes

  • SDKs should cache all flag rules locally - evaluation happens in-memory with no network call. The SDK polls the server periodically for updates.
  • Server-side: store flag rules in a database, cache in Redis. On flag change, update the cache; SDKs pick up changes on next poll.
  • Percentage rollouts: hash(user_id + flag_name) % 100 → deterministic, consistent bucketing. The same user always gets the same variant.

Learn the Concept

Practice Next

Reference SolutionClick to reveal

Why This Solution Works

Request path: The solution keeps ingress, service logic, and stateful dependencies separated so each layer can scale independently.

Reference flow: Web Clients -> API Gateway -> API Service -> Primary SQL DB -> Redis Cache

Design strengths

  • Cache sits on the read path to absorb repeated queries and keep DB pressure stable.

Interview defense

  • This design makes bottlenecks explicit (ingress, core compute, persistence, async workers).
  • It supports progressive scaling without re-architecting the core request path.
  • It keeps correctness-sensitive state changes in durable systems while offloading background work asynchronously.