Replication In Depth

Replication means keeping a copy of the same data on multiple machines connected via a network. It provides redundancy (if one node fails, data is still available), improves read throughput (queries can be served from multiple replicas), and reduces latency (replicas can be placed geographically closer to users). However, all the difficulty in replication lies in handling changes to replicated data. This chapter explores three main approaches to replication and the consistency guarantees they offer.

Why Replicate?

There are several reasons to replicate data across multiple nodes:

  • High availability: The system can continue operating even when one or more machines go down.
  • Reduced latency: Users can read from a replica that is geographically close to them rather than routing every request across the globe.
  • Read scalability: By serving read queries from multiple replicas, the system handles a much higher volume of read traffic than a single node could manage alone.
  • Disaster recovery: If an entire data center fails, a replica in another data center can take over.

The fundamental challenge is that any dataset that changes over time requires a strategy for propagating those changes to every replica. The three most common architectures are single-leader, multi-leader, and leaderless replication.

Single-Leader Replication

In single-leader replication (also called primary-secondary or master-slave replication), one node is designated as the leader (primary). All writes are sent to the leader, which first writes the new data to its local storage. It then sends the data change to all of its followers (secondaries/replicas) as part of a replication log or change stream. Each follower applies the writes in the same order as the leader.

Reads can be served by any node (leader or follower), but writes are only accepted by the leader. This is the most common replication mode, used by PostgreSQL, MySQL, MongoDB, and many other databases.

Client (Writes) Leader (Primary) replication stream Follower 1 Follower 2 Follower 3 Reads Reads Reads

Synchronous vs Asynchronous Replication

A critical design choice is whether the leader waits for followers to confirm they have received and written the data before reporting success to the client:

  • Synchronous replication: The leader waits for at least one follower to confirm the write before acknowledging success. This guarantees that the follower has an up-to-date copy. However, if the synchronous follower is slow or unresponsive, the leader must block all writes, harming availability. In practice, semi-synchronous mode is common: one follower is synchronous, the rest are asynchronous.
  • Asynchronous replication: The leader sends the write to followers but does not wait for confirmation. This means writes are fast and the leader never blocks. The downside is durability risk: if the leader fails before the data is replicated, the write is lost forever. Most production systems use fully asynchronous replication because the performance penalty of synchronous is too high at scale.

Setting Up New Followers

When you need to add a new follower (for increased read capacity or to replace a failed node), you cannot simply copy the leader's data files because the data is constantly in flux. The standard process is:

  • Take a consistent snapshot of the leader's database at a known position in the replication log (without locking the entire database).
  • Copy the snapshot to the new follower node.
  • The follower connects to the leader and requests all data changes that occurred since the snapshot position.
  • Once the follower has processed the backlog of changes, it is caught up and can continue processing new changes as they arrive.

Handling Node Outages

Follower failure (catch-up recovery): Each follower keeps a log of the data changes it has received. If a follower crashes and restarts, it knows the last transaction it successfully processed and can request all changes since that point from the leader.

Leader failure (failover): Failover is significantly more complex. One of the followers must be promoted to be the new leader, clients must be reconfigured to send writes to the new leader, and other followers must start consuming from the new leader. This can be automatic or manual.

Split-Brain: The Most Dangerous Failover Bug

Split-brain occurs when two nodes both believe they are the leader. This can happen during failover if the old leader comes back online and does not realize it has been replaced. Both nodes accept writes, and since there is no mechanism to resolve conflicts, data is permanently lost or corrupted. Common mitigations include:

  • Fencing (STONITH): "Shoot The Other Node In The Head" - the new leader forcibly powers off the old leader's machine before accepting writes.
  • Lease-based leadership: A leader holds a time-limited lease. If it cannot renew the lease (because the network partitioned it from the consensus quorum), it must step down.
  • Consensus protocols: Using Raft or Paxos to elect a new leader guarantees that only one leader exists at any time, at the cost of additional complexity.

Replication Lag

With asynchronous replication, followers may fall behind the leader. The delay between a write being applied on the leader and being reflected on a follower is called replication lag. During normal operation, this lag is a fraction of a second, but under heavy load or network issues, it can grow to seconds or even minutes. This lag introduces several consistency anomalies that we discuss below.

Multi-Leader Replication

In multi-leader (also called master-master or active-active) replication, more than one node accepts writes. Each leader simultaneously acts as a follower to the other leaders. This architecture is useful in specific scenarios:

  • Multi-datacenter operation: You can have a leader in each data center. Writes are processed locally with low latency, and then asynchronously replicated to leaders in other data centers. This tolerates entire data center outages and reduces write latency for geographically distributed users.
  • Offline-capable clients: Applications that need to work offline (like a calendar or notes app on a mobile device) effectively have a local "leader" database. When the device comes back online, it synchronizes with the server, acting as multi-leader replication.
  • Collaborative editing: Google Docs-style editing where multiple users edit the same document concurrently is a form of multi-leader replication at a very fine granularity.

Conflict Resolution

The biggest problem with multi-leader replication is write conflicts. If two leaders concurrently accept writes that modify the same record, a conflict arises when the writes are replicated to each other. Strategies for handling this include:

  • Conflict avoidance: Route all writes for a particular record to the same leader. For example, a user's data always goes to the data center closest to them. This is the simplest approach and is recommended whenever possible.
  • Last-writer-wins (LWW): Attach a timestamp to each write and keep the one with the highest timestamp. This achieves convergence but at the cost of data loss - concurrent writes are silently discarded. Used by Cassandra and DynamoDB by default.
  • Merge values: Keep both conflicting values and let the application merge them. For example, if two users add items to a shopping cart concurrently, take the union of both sets.
  • CRDTs (Conflict-free Replicated Data Types): Data structures that are mathematically guaranteed to converge when replicated. Examples include G-Counters (grow-only counters), PN-Counters, LWW-Registers, OR-Sets, and sequence CRDTs (used in collaborative text editors like Yjs and Automerge). CRDTs are powerful but limited to data types whose merge operation is commutative, associative, and idempotent.
  • Operational transformation (OT): The technique used by Google Docs. Each operation is transformed against concurrent operations to produce a converged result. More flexible than CRDTs but significantly more complex to implement.

Replication Topologies

With more than two leaders, the topology describes the communication paths along which writes are propagated:

  • All-to-all: Every leader sends its writes to every other leader. Most general topology and most fault-tolerant, but writes can arrive out of order due to varying network delays.
  • Circular (ring): Each leader forwards writes to the next leader in a ring. A failure of one node breaks the ring and stops replication.
  • Star (hub-and-spoke): One designated "root" leader forwards writes to all others. Simpler to manage but creates a single point of failure at the root.

In all-to-all topologies, causality problems can occur: an update may arrive before the insert it depends on if the messages take different network paths. Solutions include version vectors or logical timestamps to track causal ordering.

Leaderless Replication

In leaderless replication (Dynamo-style, after Amazon's Dynamo paper), there is no leader at all. Any replica can accept writes directly. The client sends writes to several replicas in parallel, and reads from several replicas in parallel, using version numbers to determine which value is newer.

This approach is used by Amazon DynamoDB, Apache Cassandra, Riak, and Voldemort. It has the advantage of high availability and tolerance of individual node failures without requiring failover.

Quorum Reads and Writes

If there are n replicas, every write must be confirmed by at least w nodes, and every read must query at least r nodes. As long as w + r > n, reads are guaranteed to see the most recent write because there must be overlap between the set of nodes that acknowledged the write and the set of nodes queried during the read.

Quorum Configurationn = 3 replicas Common configurations: w=2, r=2 โ†’ w+r=4 > 3 โœ“ (tolerates 1 unavailable node) w=3, r=1 โ†’ w+r=4 > 3 โœ“ (fast reads, slow writes) w=1, r=3 โ†’ w+r=4 > 3 โœ“ (fast writes, slow reads) w=1, r=1 โ†’ w+r=2 < 3 โœ— (no consistency guarantee) General rule: w + r > n ensures overlap If n=5, w=3, r=3 โ†’ tolerates 2 unavailable nodes

Read Repair and Anti-Entropy

Even with quorums, replicas can become stale. Two mechanisms keep them up to date:

  • Read repair: When a client reads from multiple replicas and detects that one has a stale value, it writes the newer value back to the stale replica. This works well for frequently-read values but rarely-read data may remain stale indefinitely.
  • Anti-entropy process: A background process constantly compares data between replicas and copies missing data. Unlike the replication log in leader-based systems, this process does not maintain a specific order, so there may be a significant delay before data is copied. Implementations often use Merkle trees to efficiently identify differences between replicas.

Limitations of Quorums

Even with w + r > n, edge cases can lead to stale reads:

  • If a sloppy quorum is used (see below), the w writes and r reads may not overlap at all.
  • If two writes occur concurrently, the only safe resolution is to merge them (or use LWW, which loses data).
  • If a write succeeds on some replicas but fails on others (e.g., due to disk full), it is not rolled back on the successful replicas.
  • If a node carrying a new value fails and is restored from a stale replica, the quorum condition may be violated.

Sloppy quorums and hinted handoff: In a large cluster, if the designated n nodes for a key are not all reachable, writes can be temporarily accepted by other nodes (not among the designated n). When the designated node comes back online, the temporary node forwards ("hints off") the writes to it. This increases write availability but means a quorum read may not see the latest value.

Comparison: Leader-Based vs Leaderless

Leader-Based Replication

  • All writes go through a single node, providing a clear total order of writes.
  • Followers are guaranteed to see writes in the same order as the leader.
  • Failover is needed when the leader goes down, which can cause brief unavailability.
  • Simpler to reason about: one source of truth for write ordering.
  • Read replicas can serve stale data during replication lag.
  • Used by PostgreSQL, MySQL, MongoDB, SQL Server.

Leaderless Replication

  • Any node can accept writes; no single point of failure for writes.
  • No failover needed: if a node is down, writes go to the remaining nodes.
  • Concurrent writes to the same key require conflict resolution (LWW or merge).
  • Harder to provide strong consistency; typically eventual consistency.
  • Better availability under network partitions (each partition can still serve writes).
  • Used by Cassandra, DynamoDB, Riak, Voldemort.

Consistency Guarantees Under Replication Lag

When using asynchronous replication (whether leader-based or leaderless), clients may observe strange behaviors because replicas are not all at the same point in time. Several consistency models address specific anomalies:

Read-After-Write Consistency (Read-Your-Writes)

A user who submits a write and then immediately reads should see their own update. Without this guarantee, a user might update their profile, refresh the page, and see the old profile - making them think the update was lost. Techniques to achieve this:

  • When reading something the user may have modified, read from the leader. For example, always read the user's own profile from the leader, but other users' profiles can be read from followers.
  • Track the timestamp of the user's most recent write. For one minute after the last write, route all reads to the leader. After that, reads can go to followers.
  • The client can remember the timestamp of its most recent write and pass it with the read request. The system ensures the replica serving the read has caught up to at least that timestamp.

Monotonic Reads

A user should never see time go backward. If they have already seen a value at time t, they should not subsequently see an older value. This can happen when successive reads hit different replicas at different replication lag levels. The fix: ensure each user always reads from the same replica (e.g., by hashing the user ID to choose a replica).

Consistent Prefix Reads

If a sequence of writes happens in a certain order, a reader should see them in the same order. Without this, a reader might see an answer before the question in a conversation, creating nonsensical sequences. This is particularly problematic in partitioned databases where there is no global ordering of writes. Solutions involve ensuring causally related writes are routed to the same partition or using explicit causal dependency tracking.

Eventual Consistency: What It Really Means

Eventual consistency is a very weak guarantee. It says only that if you stop writing and wait long enough, all replicas will eventually converge to the same value. It does not say how long "long enough" is, nor does it give any guarantees about what a reader will see during the convergence period. The stronger guarantees above (read-your-writes, monotonic reads, consistent prefix reads) are additional properties that a system may or may not provide on top of eventual consistency. When evaluating a database, "eventually consistent" alone tells you very little about the experience your users will have.

Replication in Practice

Database Replication Model Default Mode Consistency Guarantees
PostgreSQL Single-leader Async (semi-sync available) Read-your-writes (from leader), tunable
MySQL Single-leader Async (semi-sync plugin available) Read-your-writes (from leader)
MongoDB Single-leader (replica set) Async (w:majority available) Configurable read/write concern
Cassandra Leaderless Tunable quorum Tunable (ONE to ALL), LWW conflicts
DynamoDB Leaderless Eventually consistent reads (strongly consistent optional) Conditional writes for optimistic concurrency
CockroachDB Consensus-based (Raft) Synchronous (Raft consensus) Serializable by default

Key Takeaways

  • Single-leader replication is the simplest model: all writes go through one node, and followers replicate in the same order. Most relational databases use this model.
  • Multi-leader replication is useful for multi-datacenter deployments and offline-capable apps, but introduces write conflicts that must be resolved.
  • Leaderless replication avoids the need for failover and provides high availability, but requires quorum-based reads/writes and careful conflict handling.
  • Replication lag in asynchronous systems creates consistency anomalies. Understand the guarantees your application needs (read-your-writes, monotonic reads, consistent prefix) and choose your database and configuration accordingly.
  • Split-brain is the most dangerous failure in leader-based systems. Always have a mechanism (fencing, consensus) to prevent two nodes from acting as leader simultaneously.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Build It Yourself

Build a primary-replica topology with automated failover in our guided lab.

Start Guided Lab โ†’