Requirements for Distributed IDs
Before choosing an ID generation strategy, clarify what properties your system needs:
- Uniqueness: No two records should ever share the same ID, even across multiple servers, data centers, or time zones. This is the non-negotiable requirement.
- Sortability: IDs that are roughly time-ordered allow efficient range queries, make database indexes more cache-friendly, and let you sort by creation time without a separate column.
- Scalability: ID generation must not become a bottleneck. Each application server should be able to generate IDs independently without coordinating with a central authority on every request.
- Low latency: ID generation should add negligible overhead. Ideally, it should be a local computation, not a network call.
- Compactness: Shorter IDs use less storage and are faster to index. A 64-bit integer is far more efficient than a 128-bit UUID in B-tree indexes.
Approach 1: UUID v4 (Random)
A UUID (Universally Unique Identifier) version 4 is a 128-bit randomly generated value, typically represented as a 36-character string like 550e8400-e29b-41d4-a716-446655440000.
How It Works
Generate 122 random bits (6 bits are reserved for version and variant fields). The probability of collision is astronomically low: you would need to generate about 2.7 × 1018 UUIDs to have a 50% chance of a single collision.
// UUID v4 generation (pseudocode)
function generateUUIDv4():
bytes = random(16) // 16 random bytes = 128 bits
bytes[6] = (bytes[6] & 0x0F) | 0x40 // version 4
bytes[8] = (bytes[8] & 0x3F) | 0x80 // variant 1
return formatAsHex(bytes)
// Example: "f47ac10b-58cc-4372-a567-0e02b2c3d479"
Pros
- No coordination needed. Any server, anywhere, can generate a UUID independently.
- Extremely low collision probability without any shared state.
- Widely supported across all programming languages and databases.
Cons
- Not sortable by time. Random UUIDs scatter inserts across the entire B-tree index, causing poor write performance in databases (especially InnoDB which clusters by primary key).
- Large size: 128 bits (16 bytes) as binary, 36 characters as string. Significantly larger than a 64-bit integer.
- Poor index locality: Random distribution means every insert touches a different page in the B-tree, leading to excessive disk I/O.
Approach 2: Database Auto-Increment
The simplest approach: let the database assign a sequential integer ID using AUTO_INCREMENT (MySQL) or SERIAL / IDENTITY (PostgreSQL).
Why It Breaks in Distributed Systems
- Single point of failure: The database is the sole ID generator. If it goes down, no IDs can be created.
- Scalability ceiling: All inserts must go through one database (or one master in a primary-replica setup) to maintain sequential ordering.
- Merging is difficult: If you have two database shards each using auto-increment, they will produce conflicting IDs (both will have ID 1, 2, 3...).
Mitigation: Offset and Stride
Use different starting offsets and step sizes per shard. For 3 shards:
Shard 1: 1, 4, 7, 10, 13, ... (start=1, step=3)
Shard 2: 2, 5, 8, 11, 14, ... (start=2, step=3)
Shard 3: 3, 6, 9, 12, 15, ... (start=3, step=3)
This ensures uniqueness but IDs are no longer globally sorted by time. Adding or removing shards is also painful because you must change the step size across all shards.
Approach 3: Twitter Snowflake
Twitter's Snowflake is the most widely referenced distributed ID generator. It produces 64-bit IDs that are roughly time-sorted, unique across data centers, and generated without coordination between machines.
Bit Layout
- 1 bit: Sign bit, always 0 (ensures positive numbers).
- 41 bits: Millisecond timestamp since a custom epoch. 241 milliseconds ≈ 69 years from the epoch before overflow.
- 5 bits: Datacenter ID (0-31), allowing up to 32 data centers.
- 5 bits: Machine/worker ID (0-31), allowing 32 machines per data center.
- 12 bits: Sequence number (0-4095), allowing 4,096 IDs per millisecond per machine.
Throughput
Each machine can generate 4,096 IDs per millisecond = 4,096,000 IDs per second. With 32 data centers × 32 machines = 1,024 machines, the system can generate over 4 billion IDs per second globally.
// Snowflake ID generation (pseudocode)
class SnowflakeGenerator:
epoch = 1288834974657 // Twitter custom epoch (Nov 4, 2010)
datacenter_id // 5 bits, assigned at deployment
machine_id // 5 bits, assigned at deployment
sequence = 0
last_timestamp = -1
function generate():
timestamp = currentTimeMillis() - epoch
if timestamp == last_timestamp:
sequence = (sequence + 1) & 0xFFF // 12-bit mask
if sequence == 0:
timestamp = waitForNextMillis() // Block until next ms
else:
sequence = 0
last_timestamp = timestamp
id = (timestamp << 22)
| (datacenter_id << 17)
| (machine_id << 12)
| sequence
return id
Pros
- 64-bit: fits in a database
BIGINT, twice as compact as UUID. - Time-sorted: IDs increase over time, so B-tree inserts are always at the end (excellent write performance).
- No coordination: each machine generates IDs independently using only a local clock.
- Extractable metadata: you can decode the timestamp, data center, and machine from any ID.
Cons
- Clock dependency: If a machine's clock goes backward (NTP adjustment), duplicate IDs could be generated. Snowflake guards against this by refusing to generate IDs when the clock moves backward.
- Machine ID assignment: The datacenter and machine IDs must be globally unique and pre-assigned, typically via configuration management (ZooKeeper, etcd, or environment variables).
- Not cryptographically random: IDs are predictable and leak information about creation time and infrastructure topology.
Approach 4: ULID (Universally Unique Lexicographically Sortable Identifier)
ULID combines the best of UUID and Snowflake. It is a 128-bit identifier that is lexicographically sortable and encoded as a 26-character Crockford Base32 string.
ULID structure:
Timestamp: 48 bits (millisecond precision, ~8,900 years)
Randomness: 80 bits (cryptographically random)
Example: 01ARZ3NDEKTSV4RRFFQ69G5FAV
|----------|----------------|
timestamp randomness
(10 chars) (16 chars)
Key properties:
- Lexicographically sortable (string comparison works)
- Compatible with UUID format (128 bits)
- No machine ID coordination needed
- Monotonic within same millisecond (increment random part)
Pros
- Sortable by time using simple string comparison.
- No coordination between servers (unlike Snowflake's machine ID requirement).
- UUID-compatible: can be stored in UUID columns without schema changes.
- URL-safe encoding (Crockford Base32 avoids ambiguous characters).
Cons
- 128 bits, so larger than Snowflake's 64 bits.
- Not as widely adopted as UUID (fewer native database/language support).
- Random component is not deterministic, so two ULIDs generated in the same millisecond on the same machine are not strictly ordered (though monotonic implementations exist).
Approach 5: Ticket Servers (Flickr Approach)
Flickr's approach uses dedicated "ticket server" databases whose sole purpose is to generate sequential IDs using REPLACE INTO and LAST_INSERT_ID().
-- Ticket server schema
CREATE TABLE tickets (
id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
stub CHAR(1) NOT NULL DEFAULT '',
PRIMARY KEY (id),
UNIQUE KEY stub (stub)
) ENGINE=InnoDB;
-- Generate next ID
REPLACE INTO tickets (stub) VALUES ('a');
SELECT LAST_INSERT_ID();
-- Use two ticket servers with odd/even IDs for HA:
-- Server 1: auto_increment_offset=1, auto_increment_increment=2
-- Server 2: auto_increment_offset=2, auto_increment_increment=2
Pros
- Simple to implement and understand.
- Produces numeric, sortable IDs.
- Two servers provide high availability.
Cons
- Network round trip for every ID (adds latency).
- Ticket servers become a dependency and potential bottleneck.
- Requires careful management of the odd/even server pairing.
Comparison Table
| Approach | Size | Sortable | Coordination | Throughput | Best For |
|---|---|---|---|---|---|
| UUID v4 | 128 bits | No | None | Unlimited | Simple systems, cross-system interop |
| Auto-Increment | 32/64 bits | Yes | Single DB | DB limited | Single-server apps, prototypes |
| Snowflake | 64 bits | Yes (time) | Machine ID | 4M/s/machine | High-scale distributed systems |
| ULID | 128 bits | Yes (lexicographic) | None | Unlimited | When you need sortable UUIDs |
| Ticket Server | 64 bits | Yes | Ticket DB | DB limited | Medium-scale, simple infra |
When to Use What
- Building a small to medium web app? Database auto-increment is fine. Do not over-engineer.
- Need cross-service unique IDs with no coordination? UUID v4 or ULID. Choose ULID if you need sortability.
- Building a high-throughput distributed system (social network, messaging)? Snowflake or a Snowflake-like approach. The 64-bit size and time-sortability make it ideal.
- Migrating from UUIDs but need sortability? ULID is a drop-in replacement (same 128-bit size).
- System design interview? Start with requirements, then walk through Snowflake. It is the most commonly asked approach.
Interview Tip
When asked about unique ID generation, start by clarifying requirements: "Do we need the IDs to be sortable? How many IDs per second? Do we need them to be compact (64 bits) or is 128 bits acceptable? Do we have multiple data centers?" These questions demonstrate that you understand the trade-offs and are not just memorizing solutions.
Summary
Unique ID generation is a foundational problem in distributed systems. The right approach depends on your scale, sortability needs, and tolerance for coordination overhead. For most large-scale systems, a Snowflake-style generator (64-bit, time-sorted, machine-local) is the gold standard. For simpler needs or UUID compatibility, ULID provides an excellent balance of sortability and simplicity.