Consensus in Distributed Systems | System Design Course

Why Consensus Matters

Distributed systems consist of multiple processes communicating over an unreliable network. Several critical tasks require these processes to reach agreement on a shared decision:

Leader ElectionWhen a database primary fails, the remaining replicas must agree on exactly one new leader. If two nodes both believe they are the leader (a "split brain"), writes can diverge and data is corrupted.
Atomic CommitIn a distributed transaction spanning multiple partitions, all partitions must agree to either commit or abort. A partial commit leaves the database in an inconsistent state.
Total Order BroadcastIf every node must process messages in the same order (e.g., replicated state machines), the nodes need consensus on the sequence. This is equivalent to repeated rounds of consensus.
Distributed LockingCoordinating exclusive access to a shared resource (e.g., a file, a row, a task) requires all participants to agree on who currently holds the lock.
Configuration ManagementWhen cluster membership changes (a node joins or leaves), every node must agree on the current set of members to route traffic correctly.

Without consensus, any of these operations can lead to split brains, data loss, or silent corruption. Consensus is therefore the foundation on which reliable distributed infrastructure is built.

The FLP Impossibility Result

FLP Impossibility (Fischer, Lynch, Paterson, 1985)

In a purely asynchronous system where even a single process can crash, there is no deterministic algorithm that guarantees consensus will be reached in bounded time. More precisely: no protocol can guarantee both safety (agreement and validity) and liveness (termination) in an asynchronous model with even one faulty process.

This does not mean consensus is impossible in practice. It means that any practical algorithm must make one of these concessions:

Use timeouts to detect failures, introducing a synchrony assumption (Raft, Paxos).
Use randomization to break symmetry, achieving probabilistic termination (Ben-Or algorithm).
Weaken the guarantee to eventual convergence rather than immediate agreement (CRDTs, gossip protocols).

FLP is the reason every consensus algorithm you encounter uses heartbeats and election timeouts: they deliberately step outside the purely asynchronous model to make progress possible.

The Raft Consensus Algorithm

Raft was designed by Diego Ongaro and John Ousterhout in 2014 with one explicit goal: understandability. It provides the same guarantees as Paxos but decomposes the problem into three sub-problems: leader election, log replication, and safety.

Node States

Every node in a Raft cluster is in exactly one of three states at any given time:

Follower: Passive. Responds to RPCs from the leader and candidates. This is the starting state for all nodes.
Candidate: Actively seeking to become leader. A follower transitions to candidate if it does not receive a heartbeat within a randomized election timeout.
Leader: Handles all client requests and replicates log entries to followers. There is at most one leader per term.

Leader Election

Raft divides time into terms, which are numbered with consecutive integers. Each term begins with an election. Here is how a leader is elected:

Election timeout fires

A follower has not heard from a leader for a randomized duration (typically 150-300ms). It increments its current term, transitions to Candidate, and votes for itself.

RequestVote RPCs

The candidate sends RequestVote RPCs to all other nodes, including its term number and the index/term of its last log entry.

Voting rules

A node grants its vote if: (a) it has not already voted in this term, and (b) the candidate's log is at least as up-to-date as the voter's log. Each node votes for at most one candidate per term.

Majority wins

If the candidate receives votes from a majority of nodes (including itself), it becomes the leader for that term. It immediately sends heartbeat AppendEntries RPCs to establish authority and prevent new elections.

Split vote handling

If no candidate gets a majority (e.g., two candidates split the vote), the term ends with no leader. Each candidate waits a new randomized timeout and tries again. The randomization makes indefinite split votes extremely unlikely.

Raft Leader Election

Log Replication

Once a leader is elected, it handles all client requests. Each client write becomes a new entry in the leader's log:

The leader appends the new entry to its local log.
It sends AppendEntries RPCs to all followers in parallel, containing the new log entry along with the leader's term and the index/term of the preceding entry.
Each follower checks that the preceding entry matches its own log. If it does, the follower appends the new entry and responds with success. If it does not, the follower rejects the RPC, and the leader decrements the next index and retries.
Once the leader receives acknowledgment from a majority of nodes, the entry is considered committed. The leader applies it to its state machine and responds to the client.
Followers learn about committed entries via subsequent AppendEntries RPCs and apply them to their own state machines.

Safety Properties

Raft guarantees several safety properties that together ensure consistency:

Election Safety: At most one leader per term. This follows from the majority vote requirement and the rule that each node votes at most once per term.
Leader Append-Only: A leader never overwrites or deletes entries in its log. It only appends new entries.
Log Matching: If two logs contain an entry with the same index and term, then the logs are identical for all entries up through that index.
Leader Completeness: If a log entry is committed in a given term, it will be present in the logs of all leaders for all higher terms. This is the key safety property that prevents committed data from being lost.
State Machine Safety: If a server applies a log entry at a given index, no other server will ever apply a different entry at that index.

Paxos

Paxos, published by Leslie Lamport in 1989 (and again in 1998 in the famous "The Part-Time Parliament" paper), is the original consensus algorithm. It is proven correct but notoriously difficult to understand, implement, and extend.

Single-Decree Paxos

The basic Paxos protocol reaches agreement on a single value through two phases:

Phase 1 (Prepare): A proposer selects a proposal number n and sends a Prepare(n) message to a majority of acceptors. Each acceptor promises not to accept any proposal with a number less than n. If the acceptor has already accepted a value, it includes that value in its response.
Phase 2 (Accept): If the proposer receives promises from a majority, it sends an Accept(n, v) message, where v is either a value from the highest-numbered previously accepted proposal, or the proposer's own value if no prior proposals exist. A majority of acceptors accepting this message constitutes consensus.

Multi-Paxos

Single-decree Paxos decides one value. Real systems need to decide a sequence of values (a replicated log). Multi-Paxos extends the protocol by:

Running a separate Paxos instance for each log slot.
Electing a stable leader to skip Phase 1 for consecutive proposals, reducing message complexity from 4 messages to 2 per decision.
This is essentially what Raft formalizes: Multi-Paxos with a specific leader election mechanism and log management.

Used by: Google Chubby (lock service), Google Spanner (Multi-Paxos across data centers), Apache Mesos.

ZooKeeper & etcd: Coordination Services

Rather than implementing consensus directly, most applications delegate coordination to a dedicated service. The two dominant coordination services are ZooKeeper and etcd.

ZooKeeper

ZooKeeper provides a hierarchical key-value store (similar to a file system) with strong consistency guarantees. It uses ZAB (ZooKeeper Atomic Broadcast), a protocol similar to Raft, internally.

Data model: Znodes organized in a tree. Each znode can hold data (up to 1MB) and have children. Znodes can be persistent or ephemeral (automatically deleted when the creating session ends).
Watches: Clients can set watches on znodes to receive notifications when data changes. This enables reactive coordination patterns.
Sequential znodes: ZooKeeper can create znodes with monotonically increasing sequence numbers, enabling distributed queue and lock implementations.
Use cases: Leader election, distributed locking, configuration management, service discovery, group membership.

Used by: Apache Kafka (older versions), Apache HBase, Apache Hadoop, Apache Solr.

etcd

etcd is a distributed key-value store built on Raft. It is simpler than ZooKeeper, with a flat key space and a gRPC API.

Data model: Flat key-value pairs with versioning. Every key mutation increments a global revision number, enabling efficient watch streams.
Lease mechanism: Keys can be associated with leases (TTLs). When a lease expires, all associated keys are deleted. This provides the same ephemeral node semantics as ZooKeeper.
Transactions: etcd supports mini-transactions: if (compare) then (operations) else (operations), enabling atomic check-and-set operations.
Watch API: Clients can watch key ranges and receive ordered streams of events, enabling real-time coordination.

Used by: Kubernetes (stores all cluster state in etcd), CoreDNS, Vitess, CockroachDB.

Coordination Services Are Infrastructure

ZooKeeper and etcd are not general-purpose databases. They are designed for small amounts of critical metadata (cluster membership, configuration, leader identity) that changes infrequently. Storing application data in a coordination service is an anti-pattern that will degrade performance and stability.

Linearizability vs. Serializability

These two terms are frequently confused, even in professional literature. They describe different things and apply to different contexts.

Linearizability

A recency guarantee on individual reads and writes to a single object.
Once a write completes, all subsequent reads must return that value (or a later one).
The system behaves as if there is a single copy of the data and all operations are atomic.
Applies to: distributed registers, consensus protocols, coordination services.
Cost: high latency due to coordination. Incompatible with high availability during partitions (CAP theorem).

Serializability

An isolation guarantee on transactions involving multiple objects.
The result of executing transactions concurrently is equivalent to some serial (one-at-a-time) execution order.
Does not require recency: the equivalent serial order may differ from real-time order.
Applies to: database transactions (SQL databases with SERIALIZABLE isolation level).
Cost: reduced concurrency, potential for aborts and retries.

Strict serializability (also called "linearizable serializability" or "one-copy serializability") combines both: transactions execute as if serial, and the serial order matches real-time order. This is what Spanner and CockroachDB provide, and it is the strongest consistency guarantee available.

Fencing Tokens & Distributed Locks

A common pattern in distributed systems is using a lock to ensure exclusive access to a resource. However, distributed locks are trickier than they appear due to the possibility of process pauses (GC pauses, network delays, context switches).

The Problem

Client A acquires lock with TTL of 30 seconds.
Client A enters a long GC pause (or is slow for any reason).
The lock expires after 30 seconds.
Client B acquires the lock and begins writing to the shared resource.
Client A wakes up from its pause, believing it still holds the lock, and also writes to the resource.
Both clients have written concurrently. Data corruption ensues.

The Solution: Fencing Tokens

Every time a lock is granted, the lock service issues a fencing token: a monotonically increasing number. The resource server checks the token on every request:

Client A acquires lock with fencing token 33.
Client A pauses. Lock expires.
Client B acquires lock with fencing token 34.
Client B writes to the resource, which records token 34.
Client A wakes up and tries to write with token 33. The resource server rejects the write because 33 < 34 (the highest token it has seen).

This pattern requires the resource server to participate in the protocol. ZooKeeper's sequential znodes and etcd's revision numbers both serve as natural fencing tokens.

Martin Kleppmann's Analysis

Martin Kleppmann (author of DDIA) demonstrated that Redis-based distributed locks (Redlock) cannot provide fencing tokens because Redis does not have monotonically increasing counters tied to lock acquisition. If you need a correct distributed lock, use a consensus-based system (ZooKeeper, etcd) rather than a cache (Redis). The Redlock debate is a valuable case study in the subtleties of distributed coordination.

Practical Considerations

Cluster Size

Consensus requires a majority to make progress. A cluster of 2f + 1 nodes can tolerate f failures:

Cluster Size	Majority Needed	Failures Tolerated	Typical Use
3	2	1	Small deployments, dev/staging
5	3	2	Production (most common)
7	4	3	Cross-region deployments

Larger clusters tolerate more failures but increase the latency of commits (more nodes must acknowledge). Five nodes is the sweet spot for most production deployments.

Performance Implications

Consensus is fundamentally limited by network latency. Every committed write requires at least one round trip to a majority of nodes. In a single data center, this adds 0.5-2ms. Across regions, it can add 50-200ms. Strategies to mitigate this include:

Leader locality: Place the leader in the region where most writes originate.
Read leases: Allow followers to serve reads locally for a bounded time, reducing read latency at the cost of slight staleness.
Batching: Amortize the cost of consensus by batching multiple client requests into a single log entry.
Pipelining: Send the next AppendEntries before the previous one is acknowledged, increasing throughput.

Key Takeaways

Consensus solves leader election, atomic commit, total ordering, and distributed locking. Without it, distributed systems are vulnerable to split brains and data corruption.
The FLP result proves that no deterministic algorithm can guarantee consensus in an asynchronous system with even one crash. Practical algorithms use timeouts to circumvent this.
Raft decomposes consensus into leader election, log replication, and safety. It requires a majority of nodes to commit entries, tolerating up to f failures in a 2f+1 cluster.
Paxos is the theoretical foundation; Multi-Paxos with a stable leader is essentially what Raft formalizes with better understandability.
ZooKeeper and etcd are production-grade coordination services. Use them for metadata and coordination, never for bulk application data.
Linearizability is about recency of individual operations; serializability is about isolation of transactions. Strict serializability combines both.
Distributed locks must use fencing tokens to prevent stale clients from corrupting shared state. Prefer consensus-based lock services over cache-based ones.

Consensus & Coordination