Topic Hub
Databases in System Design
Databases are the durability and correctness backbone of distributed systems. Most architectural decisions eventually reduce to data modeling, consistency guarantees, and read-write scaling behavior.
Start Practicing: Database Replication & Read ScalingWhat It Is
A database is the authoritative system that stores and serves application state with defined guarantees for durability, consistency, isolation, and query behavior. Relational, document, key-value, and specialized engines each optimize for different workloads. The right choice is less about hype and more about matching access patterns, correctness needs, and operational constraints.
When to Use It
Use relational databases when transactions, referential integrity, and complex joins are core requirements. Billing, inventory, and identity systems benefit from strong ACID guarantees.
Use document or key-value stores when access patterns are dominated by single-key lookups with flexible schemas. Session stores, user preferences, and content catalogs are common fits.
Use specialized engines (time-series, graph, search) only when a measurable query pattern cannot be served efficiently by a general-purpose store. Each additional engine adds operational cost.
Why Databases Matters
Product features become reliable only when their data model is reliable. Order systems, messaging timelines, billing workflows, and identity services all depend on predictable write semantics and query patterns. A mismatch between data model and workload creates chronic latency and incident risk.
Scaling pressure usually appears first in storage. Traffic can be distributed across stateless services, but write coordination, index maintenance, and transaction boundaries converge in the database layer. Early schema and indexing decisions strongly influence how gracefully the system grows.
Database architecture also determines recovery posture. Backup cadence, replication topology, and failover procedures define real-world RTO and RPO outcomes. Teams that treat these as secondary concerns often discover during incidents that theoretical durability was never operationalized.
Data quality also drives downstream analytics and machine learning outcomes. If operational schemas lack clear ownership, retention rules, and consistency guarantees, downstream reporting drifts from reality. Good database discipline in transactional paths improves decision quality across the entire organization.
Core Concepts and Mental Models
Model data around access paths, not abstract entities alone. If critical queries need low latency, ensure indexes and partition keys align with those exact filters and sort orders. Query-driven modeling reduces expensive rewrites later when traffic and feature complexity increase.
Separate transactional boundaries from analytical workloads. OLTP systems need tight latency and correctness, while analytics can tolerate batch latency and denormalization. Isolating these patterns protects user-facing paths and gives data teams freedom to optimize heavy reporting separately.
Replication and partitioning solve different problems. Replication improves read scale and availability, while partitioning addresses write throughput and dataset size. Systems that mix these concerns without clear intent often end up with expensive topologies that solve neither bottleneck well.
Key Tradeoffs
| Decision | Upside | Downside | Guidance |
|---|---|---|---|
| SQL vs NoSQL | SQL provides transactions and referential integrity | NoSQL offers flexible schema and horizontal write scaling | Choose based on query patterns and consistency needs, not popularity |
| Normalized vs denormalized schema | Normalization reduces write anomalies | Denormalization reduces join cost on read-heavy paths | Normalize writes, denormalize read models when latency requires it |
| Single-region vs multi-region | Single-region is simpler and cheaper | Multi-region improves availability and local read latency | Start single-region; add replication when RTO/RPO or latency targets demand it |
Common Mistakes
- Over-normalization on hot read paths: excessive joins increase latency. Denormalize read models where access frequency justifies the trade.
- Treating eventual consistency as free: it reduces coordination cost but introduces stale reads and conflict resolution complexity. Reserve it for paths where business rules tolerate temporal lag.
- Database sprawl without justification: each new engine adds backup complexity, operational cost, and incident coordination burden. Add engines only for measurable bottlenecks.
Implementation Playbook
Begin with a schema that captures invariants explicitly. Use constraints, unique indexes, and transaction boundaries to encode business rules near the data layer, not only in application code. This prevents drift across services and reduces silent corruption during edge-case failures.
Instrument query performance continuously. Track slow queries, lock wait time, replication lag, and index hit ratio. Visibility into these signals lets teams catch regressions from feature launches or migration scripts before they become user-facing incidents.
Treat migrations as deployable software. Prefer additive changes, dual-write or backfill phases when needed, and rollback-safe plans. Data migration mistakes are harder to reverse than stateless code rollbacks, so cautious sequencing is a core engineering discipline.
Formalize backup and restore drills. A backup policy is only credible when restore time and data integrity are tested regularly. Teams that rehearse restores by service domain recover faster during real incidents and avoid discover-on-fire surprises in tooling, credentials, or data lineage.
Practice Path for Databases
Course Chapters
- Database Fundamentals
Storage models, transactions, and index basics for production systems.
- Database Scaling
Replication, partitioning, and write path constraints at scale.
- Data Models and Query Languages
Choosing relational, document, or hybrid schema strategies.
- Storage Engines and Indexing
B-trees, LSM trees, and the read-write amplification tradeoff.
Guided Labs
- Database Replication & Read Scaling
Add read replicas to scale database reads and add failover for high availability.
- NoSQL & Document Databases
When to choose NoSQL over SQL and how to model data for document stores like MongoDB.
- Schema Design Workshop
Balance normalized OLTP schemas, denormalized read models, and zero-downtime migration controls.
- Transactions & Isolation in Practice
Design a checkout flow that keeps correctness under retries, async workers, and partial failures.
Challenge Progression
- 1.Cake Shop 1 - Going OnlineCake Shop · easy
- 2.IoT Platform 1 - Smart Home HubIoT Platform · easy
- 3.Payment Gateway 1 - Online CheckoutPayment Gateway · easy
- 4.Social Feed 1 - MVP LaunchSocial Feed · easy
- 5.Bookmark ManagerStarter · easy
- 6.Contact Form ServiceStarter · easy
Public Solution Walkthroughs
- Cake Shop 1 - Going OnlineFull solution walkthrough with architecture breakdown
- IoT Platform 1 - Smart Home HubFull solution walkthrough with architecture breakdown
- Payment Gateway 1 - Online CheckoutFull solution walkthrough with architecture breakdown
- Social Feed 1 - MVP LaunchFull solution walkthrough with architecture breakdown
Related Articles
Back-of-the-Envelope Estimation for System Design Interviews
A step-by-step framework for capacity estimation: QPS, storage, bandwidth, and memory calculations that interviewers actually expect.
10 min read
Database Scaling Strategies: Replication, Sharding, and Partitioning
A practical guide to scaling databases in system design: when to replicate, when to shard, and how partitioning strategies affect your architecture.
9 min read
Frequently Asked Questions
How do I choose between SQL and NoSQL for a new feature?
Start with required invariants and query patterns. SQL is usually better for transactions and relational constraints. NoSQL fits high-throughput, flexible-schema access paths. Many systems use both, but each store should have a clear bounded purpose.
When should I shard a database?
Shard when single-node vertical scaling, indexing, and replication strategies no longer satisfy throughput or storage requirements. Sharding adds major operational and query complexity, so delay it until objective limits are clear.
What database metrics are most important in production?
Watch p95 and p99 query latency, lock contention, replication lag, connection saturation, and error rates by query class. Pair these with storage growth and checkpoint behavior to anticipate capacity issues.
Can read replicas replace a caching strategy?
They solve different layers of the problem. Replicas distribute read load and improve availability, while caches reduce repeated query cost and latency. High-scale systems usually use both with explicit consistency boundaries.