Guided LabsChallengesPricingDesign Lab
CoursesTopicsQuizzes
DocsBlogSolutions
LoginSignup
Menu
Guided LabsChallengesPricingDesign Lab
DocsBlogSolutions
LoginSignup

Blog

Database Replication Patterns for System Design

March 31, 2026 · Updated March 31, 2026 · 9 min read

Leader-follower, multi-leader, and leaderless replication compared. How to handle replication lag, failover, and split-brain without losing data.

Definition

Database replication copies data across multiple database servers so that reads can scale horizontally, data survives node failures, and clients in different regions can read from nearby replicas.

Implementation Checklist

  • Start with leader-follower replication. It is the simplest model and handles 90% of read-scaling needs.
  • Monitor replication lag continuously. Stale reads from lagging replicas cause subtle bugs that are hard to reproduce.
  • Design your application to tolerate eventual consistency. If a write is followed by an immediate read, route that read to the leader, not a replica.
  • Test failover regularly. An untested failover procedure is no procedure at all. Automate promotion and DNS updates.

Replication Lag Is a Feature, Not a Bug

Replication lag is an inherent property of distributed databases. Rather than fighting it, design your application around it. Separate read paths (replica-safe) from write paths (leader-required) explicitly.

For operations where the user must see their own write immediately (post a comment then see it), use read-your-writes consistency by routing that specific read to the leader. For everything else, replicas are fine.

Failover Is the Real Test

A replication setup is only as good as its failover procedure. Automated failover that has never been tested is a liability. Run chaos engineering exercises that kill the leader and measure time-to-recovery.

Document the runbook: how long does promotion take, what data loss is acceptable, who gets paged, and what clients need to reconnect. Practice until the team can execute it under stress.

Tradeoff Table

DecisionSpeed-First OptionReliability-First OptionRecommended When
Synchronous vs Asynchronous ReplicationAsync replication has lower write latency since the leader does not wait for replicasSync replication guarantees zero data loss on leader failure but adds write latencyUse async for most workloads. Use semi-sync (one sync replica + async rest) when zero data loss is critical
Leader-Follower vs Multi-LeaderLeader-follower is simpler: one write path, no conflict resolutionMulti-leader allows writes in multiple regions, reducing write latency for geo-distributed usersUse multi-leader only when you genuinely need writes from multiple regions and can handle conflict resolution complexity
Read Replicas vs CachingCache delivers sub-millisecond reads and offloads the database completelyRead replicas serve consistent (if slightly stale) data without cache invalidation complexityUse both: cache for hot keys with high read amplification, replicas for long-tail queries and reporting

Practice Next

Replication Topic Hub

Definitions, patterns, and production considerations for database replication.

Database Replication Lab

Practice setting up read replicas, failover, and replication topology in the interactive lab.

Challenges

  • Cake Shop 3 - Going International

    Design multi-region replication and consistency for a globally distributed bakery.

  • Cake Shop 4 - Marketplace Scale

    Handle marketplace-scale traffic with sharded, replicated databases across regions.

Newsletter CTA

Join the SystemForces newsletter for practical architecture and distributed systems notes.

Get weekly system design breakdowns

Frequently Asked Questions

What happens during a leader failover?

The monitoring system detects the leader is unresponsive, selects a replica with the most up-to-date data, promotes it to leader, and reconfigures other replicas to follow the new leader. There is a brief window of unavailability for writes.

How do I handle split-brain in replication?

Split-brain occurs when two nodes both believe they are the leader. Prevent it with fencing mechanisms (STONITH), quorum-based leader election, or lease-based leadership that expires automatically.

Can I use read replicas for analytics queries?

Yes, this is one of the best use cases. Route heavy analytics queries to a dedicated read replica so they do not impact production read/write performance. Accept that analytics data may lag by seconds to minutes.