Partitioning & Sharding In Depth

When your dataset grows beyond what a single node can handle - either in storage capacity or query throughput - you need to break it into partitions (also called shards, regions, tablets, or vnodes depending on the system). Each partition is a small database of its own, and each piece of data belongs to exactly one partition. This chapter covers how to partition data effectively, handle secondary indexes, rebalance partitions as the cluster grows, and route queries to the right partition.

Why Partition?

The main goal of partitioning is scalability. By distributing data across many nodes, you can spread the query load. A query that only touches one partition can be executed on the node that owns that partition, so throughput scales linearly with the number of partitions (ideally). However, queries that span multiple partitions (scatter-gather queries) are much more expensive.

Partitioning is usually combined with replication, so each partition is stored on multiple nodes for fault tolerance. Each node may be the leader for some partitions and a follower for others.

Partitioning Strategies

Key-Range Partitioning

Assign a continuous range of keys to each partition, similar to volumes of an encyclopedia (A-D, E-H, ...). Within each partition, keys are stored in sorted order, which makes range scans efficient. This is used by HBase, Bigtable, and early versions of MongoDB.

The boundaries between ranges can be chosen manually or automatically by the database. The risk is hot spots: if the key is a timestamp, all writes for "today" go to the same partition while historical partitions sit idle.

Hash Partitioning

Apply a hash function to the key and assign each partition a range of hash values. A good hash function distributes keys uniformly across partitions, eliminating hot spots caused by skewed key distributions. This is the default in Cassandra, DynamoDB, and Riak.

The tradeoff: you lose the ability to do efficient range queries. Adjacent keys (e.g., user_001 and user_002) may end up on completely different partitions, so a range query over a contiguous key space must query all partitions.

Hash Partitioning Example (Python)import hashlib def get_partition(key: str, num_partitions: int) -> int: """Assign a key to a partition using consistent hashing.""" hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16) return hash_value % num_partitions # Example: 4 partitions keys = ["user:1001", "user:1002", "user:1003", "user:1004", "user:1005"] for key in keys: partition = get_partition(key, 4) print(f" {key} โ†’ partition {partition}") # Output (deterministic): # user:1001 โ†’ partition 2 # user:1002 โ†’ partition 0 # user:1003 โ†’ partition 3 # user:1004 โ†’ partition 1 # user:1005 โ†’ partition 2

Compound Keys: The Compromise

Cassandra offers a compromise: the first part of the key (the partition key) is hashed to determine the partition, while the remaining columns (the clustering columns) determine the sort order within that partition. For example, on a social media platform, a key of (user_id, timestamp) would hash user_id to choose the partition, and within that partition, posts are sorted by timestamp. This allows efficient range queries over a single user's posts while distributing data evenly across partitions by user.

Strategy Key Distribution Range Queries Hot Spot Risk Used By
Key-Range Contiguous ranges of sorted keys Efficient (single partition for adjacent keys) High (sequential keys cluster on one partition) HBase, Bigtable, old MongoDB
Hash Hash of key modulo number of partitions Inefficient (must query all partitions) Low (uniform distribution) Cassandra, DynamoDB, Riak
Compound (Hash + Range) Hash on partition key, sort by clustering key Efficient within one partition key Medium (depends on partition key cardinality) Cassandra, DynamoDB (sort key)

Hot Spots and Skew

Even with hash partitioning, hot spots can occur if a single key is extremely popular. This is sometimes called the celebrity problem: a celebrity with millions of followers posts an update, and all reads/writes for that key flood a single partition.

Hash partitioning does not help here because the hot spot is on a single key, not a range. Strategies to mitigate:

  • Key salting: Append a random number (0 to 99) to the hot key, spreading writes across 100 partitions. The reader must then read from all 100 salted keys and combine the results. This adds read complexity but distributes write load.
  • Application-level splitting: The application detects hot keys (e.g., from a list of known celebrity accounts) and adds special routing logic only for those keys.
  • Rate limiting and caching: Instead of solving the problem in the storage layer, cache the hot key's value in memory and serve reads from the cache.

Salting Keys: A Practical Example

Suppose user ID celebrity_123 is extremely popular. Instead of writing to one key, you write to celebrity_123:00, celebrity_123:01, ..., celebrity_123:99. This spreads the load across up to 100 different partitions. When reading, you query all 100 keys in parallel and merge the results. This is a targeted optimization - apply it only to known hot keys, not to all keys, otherwise you add unnecessary overhead to every query.

Secondary Indexes and Partitioning

Secondary indexes are the bread and butter of relational databases ("find all orders where customer_id = 42") and document databases ("find all products where color = red"). But secondary indexes do not map neatly to partitions. Two main approaches exist:

Document-Partitioned Index (Local Index)

Each partition maintains its own secondary index, covering only the documents in that partition. When you write, you only update the index on the partition that owns the document. However, when you read by a secondary index, you must query all partitions (scatter-gather), because the value you are searching for could be in any partition. This approach is used by MongoDB, Riak, Cassandra, Elasticsearch, SolrCloud, and VoltDB.

Scatter-gather queries are expensive and can have high tail latency (one slow partition slows the entire query). But writes are simple and fast.

Term-Partitioned Index (Global Index)

A global index covers all data across all partitions, but the index itself is partitioned. For example, colors starting with a-m go to partition 0's index, and n-z go to partition 1's index. Reads are efficient (only query the partition(s) of the index that cover the term you are looking for), but writes are slower because a single document write may need to update indexes on multiple partitions. In practice, global secondary index updates are done asynchronously, so the index may not immediately reflect recent writes.

DynamoDB global secondary indexes use this approach. Amazon documents that they are updated within a fraction of a second under normal conditions, but there is no guarantee of consistency.

Local (Document-Partitioned) Index

  • Each partition has its own index covering only local data.
  • Writes are fast: update only the local partition's index.
  • Reads by secondary key require scatter-gather across all partitions.
  • High tail latency for reads (slowest partition determines total time).
  • Used by MongoDB, Elasticsearch, Cassandra.

Global (Term-Partitioned) Index

  • Single logical index covering all data, itself partitioned across nodes.
  • Reads are fast: only query relevant index partition(s).
  • Writes are slower: may need to update index partitions on other nodes.
  • Usually updated asynchronously, so reads may see stale index.
  • Used by DynamoDB GSI, Amazon Aurora.

Rebalancing Partitions

Over time, conditions change: query throughput increases, dataset size grows, a node fails and must be replaced. All of these require moving data between nodes, a process called rebalancing. The goals of rebalancing are:

  • After rebalancing, load should be spread fairly across the cluster.
  • During rebalancing, the database should continue serving reads and writes.
  • No more data than necessary should be moved to minimize network and disk I/O.

Why Not Hash Mod N?

A naive approach is hash(key) % num_nodes. The problem: when you add or remove a node, n changes, and most keys get reassigned to a different node. If you go from 10 to 11 nodes, roughly 90% of keys must be moved. This is unacceptable for large datasets.

Fixed Number of Partitions

Create many more partitions than there are nodes (e.g., 1,000 partitions across 10 nodes, so each node has ~100 partitions). When a node is added, it steals a few partitions from every existing node. When a node is removed, its partitions are distributed among remaining nodes. The partitions themselves never change - only their assignment to nodes changes. Used by Riak, Elasticsearch, Couchbase, and Voldemort.

The challenge: you must choose the number of partitions upfront. Too few partitions limits your ability to scale. Too many partitions adds management overhead. The right number depends on your predicted maximum dataset size.

Dynamic Partitioning

Start with a small number of partitions and split them as they grow. When a partition exceeds a configured size threshold, split it into two. When it shrinks below a threshold, merge it with an adjacent partition. This is analogous to how B-trees split and merge pages. Used by HBase, RethinkDB, and MongoDB.

The advantage: the number of partitions adapts to the data volume. The number of partitions is proportional to the dataset size, so each partition stays within a manageable size.

Proportional to Nodes (Consistent Hashing)

Keep the number of partitions proportional to the number of nodes: when a new node joins, it randomly splits a fixed number of existing partitions and takes ownership of half of each split. Used by Cassandra. This approach ensures that each node has a roughly fixed number of partitions regardless of cluster size.

Rebalancing Strategy Partition Count When to Use Used By
Fixed partitions Set at creation time (e.g., 256 or 1024) Dataset size is predictable; want simple operations Riak, Elasticsearch, Couchbase
Dynamic splitting Grows/shrinks with data volume Dataset size is unpredictable or varies widely HBase, MongoDB, RethinkDB
Proportional to nodes Fixed per node (e.g., 256 per node) Cluster size changes frequently Cassandra

Request Routing

When a client wants to read or write a key, how does it know which node to contact? This is an instance of a general problem called service discovery. Three common approaches:

  • Client contacts any node: The node checks whether it owns the requested partition. If not, it forwards the request to the correct node and relays the response. This is used by Cassandra and Riak (using a gossip protocol to disseminate partition-to-node mappings among all nodes).
  • Routing tier: A dedicated partition-aware load balancer sits between clients and the cluster. It maintains the partition map and routes each request to the correct node. This adds a network hop but simplifies clients. Used by MongoDB's mongos router.
  • Partition-aware client: The client itself maintains the partition map and contacts the correct node directly. This is the fastest approach but requires the client to keep its partition map up to date. Used by clients that talk to ZooKeeper or etcd to learn the current partition assignments (e.g., Kafka consumers).

ZooKeeper and Partition Assignment

Many distributed systems use a coordination service like Apache ZooKeeper (or etcd, or Consul) to track partition-to-node assignments. When a partition is reassigned to a different node, ZooKeeper is updated, and all clients that subscribe to this information are notified. This ensures routing information is always up to date without relying on gossip convergence. HBase, Kafka, and SolrCloud all rely on ZooKeeper for this purpose.

Partitioning and Replication Together

In practice, partitioning and replication are used together. Each partition is replicated to multiple nodes. A common setup is to have each node be the leader for some partitions and a follower for others. For example, in a 3-node cluster with a replication factor of 3 and 6 partitions:

Partition + Replication LayoutNode A: Leader(P1, P2) Follower(P3, P4, P5, P6) Node B: Leader(P3, P4) Follower(P1, P2, P5, P6) Node C: Leader(P5, P6) Follower(P1, P2, P3, P4) Each partition has 1 leader + 2 followers = replication factor 3 Each node is leader for 2 partitions, follower for 4 Writes to P1 go to Node A (leader), replicated to B and C

Key Takeaways

  • Key-range partitioning enables efficient range queries but risks hot spots. Hash partitioning distributes load evenly but loses range query ability. Compound keys offer a middle ground.
  • Secondary indexes on partitioned data must be either local (fast writes, scatter-gather reads) or global (efficient reads, slow async writes).
  • Never use hash(key) % n for partitioning because adding/removing nodes causes massive data movement. Use fixed partitions, dynamic splitting, or consistent hashing.
  • Hot spots from popular keys (the celebrity problem) require application-level workarounds like key salting or caching.
  • Request routing can be handled by any-node forwarding, a dedicated routing tier, or partition-aware clients, each with different latency and complexity trade-offs.

Chapter Check-Up

Quick quiz to reinforce what you just learned.