Topic Hub

Databases in System Design

Databases are the durability and correctness backbone of distributed systems. Most architectural decisions eventually reduce to data modeling, consistency guarantees, and read-write scaling behavior.

Start Practicing: Database Replication & Read Scaling

What It Is

A database is the authoritative system that stores and serves application state with defined guarantees for durability, consistency, isolation, and query behavior. Relational, document, key-value, and specialized engines each optimize for different workloads. The right choice is less about hype and more about matching access patterns, correctness needs, and operational constraints.

When to Use It

Use relational databases when transactions, referential integrity, and complex joins are core requirements. Billing, inventory, and identity systems benefit from strong ACID guarantees.

Use document or key-value stores when access patterns are dominated by single-key lookups with flexible schemas. Session stores, user preferences, and content catalogs are common fits.

Use specialized engines (time-series, graph, search) only when a measurable query pattern cannot be served efficiently by a general-purpose store. Each additional engine adds operational cost.

Why Databases Matters

Product features become reliable only when their data model is reliable. Order systems, messaging timelines, billing workflows, and identity services all depend on predictable write semantics and query patterns. A mismatch between data model and workload creates chronic latency and incident risk.

Scaling pressure usually appears first in storage. Traffic can be distributed across stateless services, but write coordination, index maintenance, and transaction boundaries converge in the database layer. Early schema and indexing decisions strongly influence how gracefully the system grows.

Database architecture also determines recovery posture. Backup cadence, replication topology, and failover procedures define real-world RTO and RPO outcomes. Teams that treat these as secondary concerns often discover during incidents that theoretical durability was never operationalized.

Data quality also drives downstream analytics and machine learning outcomes. If operational schemas lack clear ownership, retention rules, and consistency guarantees, downstream reporting drifts from reality. Good database discipline in transactional paths improves decision quality across the entire organization.

Core Concepts and Mental Models

Model data around access paths, not abstract entities alone. If critical queries need low latency, ensure indexes and partition keys align with those exact filters and sort orders. Query-driven modeling reduces expensive rewrites later when traffic and feature complexity increase.

Separate transactional boundaries from analytical workloads. OLTP systems need tight latency and correctness, while analytics can tolerate batch latency and denormalization. Isolating these patterns protects user-facing paths and gives data teams freedom to optimize heavy reporting separately.

Replication and partitioning solve different problems. Replication improves read scale and availability, while partitioning addresses write throughput and dataset size. Systems that mix these concerns without clear intent often end up with expensive topologies that solve neither bottleneck well.

Key Tradeoffs

DecisionUpsideDownsideGuidance
SQL vs NoSQLSQL provides transactions and referential integrityNoSQL offers flexible schema and horizontal write scalingChoose based on query patterns and consistency needs, not popularity
Normalized vs denormalized schemaNormalization reduces write anomaliesDenormalization reduces join cost on read-heavy pathsNormalize writes, denormalize read models when latency requires it
Single-region vs multi-regionSingle-region is simpler and cheaperMulti-region improves availability and local read latencyStart single-region; add replication when RTO/RPO or latency targets demand it

Common Mistakes

  • Over-normalization on hot read paths: excessive joins increase latency. Denormalize read models where access frequency justifies the trade.
  • Treating eventual consistency as free: it reduces coordination cost but introduces stale reads and conflict resolution complexity. Reserve it for paths where business rules tolerate temporal lag.
  • Database sprawl without justification: each new engine adds backup complexity, operational cost, and incident coordination burden. Add engines only for measurable bottlenecks.

Implementation Playbook

Begin with a schema that captures invariants explicitly. Use constraints, unique indexes, and transaction boundaries to encode business rules near the data layer, not only in application code. This prevents drift across services and reduces silent corruption during edge-case failures.

Instrument query performance continuously. Track slow queries, lock wait time, replication lag, and index hit ratio. Visibility into these signals lets teams catch regressions from feature launches or migration scripts before they become user-facing incidents.

Treat migrations as deployable software. Prefer additive changes, dual-write or backfill phases when needed, and rollback-safe plans. Data migration mistakes are harder to reverse than stateless code rollbacks, so cautious sequencing is a core engineering discipline.

Formalize backup and restore drills. A backup policy is only credible when restore time and data integrity are tested regularly. Teams that rehearse restores by service domain recover faster during real incidents and avoid discover-on-fire surprises in tooling, credentials, or data lineage.

Practice Path for Databases

Course Chapters

Guided Labs

Challenge Progression

  1. 1.Cake Shop 1 - Going OnlineCake Shop · easy
  2. 2.IoT Platform 1 - Smart Home HubIoT Platform · easy
  3. 3.Payment Gateway 1 - Online CheckoutPayment Gateway · easy
  4. 4.Social Feed 1 - MVP LaunchSocial Feed · easy
  5. 5.Bookmark ManagerStarter · easy
  6. 6.Contact Form ServiceStarter · easy

Public Solution Walkthroughs

Related Articles

Frequently Asked Questions

How do I choose between SQL and NoSQL for a new feature?

Start with required invariants and query patterns. SQL is usually better for transactions and relational constraints. NoSQL fits high-throughput, flexible-schema access paths. Many systems use both, but each store should have a clear bounded purpose.

When should I shard a database?

Shard when single-node vertical scaling, indexing, and replication strategies no longer satisfy throughput or storage requirements. Sharding adds major operational and query complexity, so delay it until objective limits are clear.

What database metrics are most important in production?

Watch p95 and p99 query latency, lock contention, replication lag, connection saturation, and error rates by query class. Pair these with storage growth and checkpoint behavior to anticipate capacity issues.

Can read replicas replace a caching strategy?

They solve different layers of the problem. Replicas distribute read load and improve availability, while caches reduce repeated query cost and latency. High-scale systems usually use both with explicit consistency boundaries.