Consistent Hashing

Consistent hashing is a technique for distributing data across a dynamic set of nodes in a way that minimizes redistribution when nodes are added or removed. It is fundamental to distributed caches, databases, and load balancers.

The Problem with Simple Hashing

With simple hash-based distribution, you assign a key to a server using server = hash(key) % N, where N is the number of servers. This works well until N changes.

If you add or remove a server (N changes), nearly every key maps to a different server. This causes a massive cache miss storm: all cached data effectively becomes invalid, and every request hits the database simultaneously.

Example
With 4 servers and 1 million keys, changing to 5 servers remaps approximately 80% of all keys. That is 800,000 cache misses all at once. For a high-traffic system, this can cause a cascading failure.

The Hash Ring

Consistent hashing maps both keys and nodes onto a circular hash space (a "ring" from 0 to 2^32 - 1). Each node is placed on the ring at the position determined by hashing its identifier. Each key is placed on the ring by hashing the key, then walking clockwise to find the first node: that is the node responsible for the key.

Consistent Hash Ring
A B C k1 --> A k2 --> C k3 --> A k4 --> B Server nodes Keys (walk clockwise to find node) clockwise

Adding or Removing a Node

When a new node is added to the ring, only the keys between the new node and its predecessor (counter-clockwise neighbor) are reassigned. All other keys remain on their current nodes.

When a node is removed, only its keys move to the next clockwise node. The rest of the ring is unaffected.

Impact
With N nodes, adding or removing one node only remaps approximately 1/N of the keys. With 10 nodes, only ~10% of keys are affected. Compare this to simple hashing, where ~90% would be remapped.

Virtual Nodes (Vnodes)

With just a few physical nodes, the distribution can be uneven: one node might get a disproportionately large arc of the ring. Virtual nodes solve this.

Instead of mapping each physical node to one position on the ring, you create multiple "virtual" copies (e.g., 100-200 per physical node). Each virtual node maps to a different position on the ring but all point back to the same physical machine.

Benefits of Virtual Nodes

  • Even distribution: With many points on the ring, the arc lengths become approximately equal. Data is distributed more uniformly.
  • Smoother rebalancing: When a physical node is added, its virtual copies are spread across the ring, absorbing keys from multiple other nodes instead of just one.
  • Heterogeneous hardware: A more powerful server can be assigned more virtual nodes, proportionally receiving more data.

Real-World Usage

SystemHow It Uses Consistent Hashing
Amazon DynamoDBPartitions data across nodes. Each key is assigned to a node using consistent hashing with virtual nodes.
Apache CassandraDistributes data using a token ring (consistent hash ring). Each node is responsible for a range of tokens.
Memcached / Redis ClusterDistributes cache keys across cache nodes. Adding a node only invalidates ~1/N of the cache.
Load BalancersSome LBs use consistent hashing to route requests to the same backend for session affinity.
Akamai CDNOriginally developed consistent hashing to map web content to edge servers.

Implementation Sketch

Pseudocodeclass ConsistentHash: ring = sorted_map<int, string> # hash_value -> node_id replicas = 150 # virtual nodes per physical node add_node(node_id): for i in 0..replicas: hash_val = hash(node_id + ":" + i) ring[hash_val] = node_id remove_node(node_id): for i in 0..replicas: hash_val = hash(node_id + ":" + i) ring.remove(hash_val) get_node(key): hash_val = hash(key) # Find first ring position >= hash_val (clockwise) node_id = ring.ceiling(hash_val) if node_id is None: node_id = ring.first() # Wrap around return node_id

Key Takeaways

  • Simple hash-mod distribution causes massive data redistribution when nodes change. Consistent hashing limits redistribution to ~1/N of keys.
  • The hash ring maps both keys and nodes to the same circular space. Keys walk clockwise to find their node.
  • Virtual nodes are essential for even distribution and smooth rebalancing. Use 100-200 virtual nodes per physical node.
  • Consistent hashing is used in DynamoDB, Cassandra, CDNs, distributed caches, and load balancers.
  • Understand this concept thoroughly: it appears in nearly every system design interview involving distributed systems.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Practice What You Learned

Experiment with NoSQL patterns and observe how data distribution affects performance.

Start Guided Lab โ†’