Distributed Transactions

ACID Revisited

The safety guarantees provided by transactions are often described by the acronym ACID: Atomicity, Consistency, Isolation, and Durability. But these terms are surprisingly imprecise in practice. Let us clarify what each one actually means.

ACID: What Each Letter Really Means

Atomicity: Not about concurrency (that is isolation). Atomicity means that if a transaction makes several writes and one of them fails (or the process crashes midway), the transaction is aborted and the database discards all writes from that transaction. The application can safely retry because it knows no partial results were saved. Think "abortability" rather than "atomicity."
Consistency: This is actually a property of the application, not the database. It means that certain invariants (like "credits and debits must balance") are always true. The database can help enforce some invariants (foreign keys, unique constraints), but ultimately the application must define what "consistent" means and write transactions that maintain those invariants.
Isolation: Concurrently executing transactions do not interfere with each other. The classic textbook definition says each transaction can pretend it is the only one running (serializability), but in practice, most databases use weaker isolation levels for performance.
Durability: Once a transaction has committed, the data will not be lost even if the hardware fails or the database crashes. In a single-node database, this typically means writing to a write-ahead log (WAL) and/or flushing to disk. In a replicated database, durability may mean the data has been replicated to some number of nodes.

Isolation Levels

Concurrency bugs are hard to test for and hard to reason about. Isolation levels define what kinds of anomalies can occur when multiple transactions run concurrently. Stronger isolation prevents more anomalies but comes at a higher performance cost.

Read Committed

The most basic isolation level that makes two guarantees:

No dirty reads: You will only see data that has been committed. If a transaction has written data but not yet committed, other transactions cannot see those writes.
No dirty writes: You will only overwrite data that has been committed. If two transactions concurrently try to update the same object, the second write blocks until the first transaction commits or aborts.

Read committed is the default in PostgreSQL, Oracle, and SQL Server. It is implemented by holding row-level locks for writes (released on commit) and using the old committed value for reads (so readers never block writers).

However, read committed does not prevent read skew (non-repeatable reads). If you read two related pieces of data (e.g., two account balances for a transfer) at different points in time, you might see an inconsistent state where money appears to have vanished temporarily.

Snapshot Isolation (Repeatable Read / MVCC)

Snapshot isolation gives each transaction a consistent view of the entire database as it was at the start of the transaction. Even if other transactions commit changes during your transaction, you always see the snapshot from the beginning. This prevents read skew.

The implementation uses Multi-Version Concurrency Control (MVCC): the database keeps multiple versions of each object, tagged with the transaction ID that created them. When a transaction reads an object, the database returns the version that was committed at the time the reading transaction started, ignoring any later writes. This means readers never block writers and writers never block readers.

Snapshot isolation is called "repeatable read" in PostgreSQL and MySQL/InnoDB. It is also the default isolation level in MySQL/InnoDB.

MVCC in Practice

When a transaction writes to a row, the database does not overwrite the old version in place. Instead, it creates a new version of the row with the transaction's ID. Both versions coexist. When the transaction commits, newer transactions see the new version; ongoing transactions that started before the commit continue to see the old version. Old versions are eventually garbage-collected when no transaction can possibly need them anymore.

Serializability

The strongest isolation level. It guarantees that even though transactions may execute concurrently, the result is the same as if they had executed serially (one after another). This prevents all concurrency anomalies, including write skew and phantom reads. Three main implementation strategies exist:

Actual serial execution: Execute transactions one at a time on a single thread. This eliminates concurrency entirely. Viable when transactions are fast (stored procedures, in-memory data). Used by VoltDB, Redis (single-threaded), and Datomic.
Two-Phase Locking (2PL): The traditional approach (used for ~30 years). Explained in detail below.
Serializable Snapshot Isolation (SSI): An optimistic approach built on snapshot isolation. Explained in detail below.

Isolation Level	Dirty Reads	Dirty Writes	Read Skew	Lost Updates	Write Skew	Phantoms
Read Uncommitted	Possible	Possible	Possible	Possible	Possible	Possible
Read Committed	Prevented	Prevented	Possible	Possible	Possible	Possible
Snapshot Isolation	Prevented	Prevented	Prevented	Prevented*	Possible	Possible
Serializable	Prevented	Prevented	Prevented	Prevented	Prevented	Prevented

* Snapshot isolation prevents lost updates in most implementations via first-committer-wins or compare-and-set, but this is implementation-dependent.

Write Skew and Phantom Reads

Write skew is a subtle concurrency anomaly that snapshot isolation does not prevent. It occurs when two transactions read the same data, make decisions based on what they read, and then write to different objects. Neither transaction's write conflicts with the other at the row level, but the combination of writes violates an application invariant.

Write Skew Example: On-Call Doctor Scheduling-- Invariant: At least one doctor must be on call at all times.
-- Currently, both Alice and Bob are on call.

-- Transaction 1 (Alice wants to go off call):
BEGIN;
SELECT count(*) FROM doctors WHERE on_call = true AND shift_id = 1234;
-- Result: 2 (Alice and Bob are both on call)
-- Application logic: 2 >= 2, so it's safe for one to leave
UPDATE doctors SET on_call = false WHERE name = 'Alice' AND shift_id = 1234;
COMMIT;

-- Transaction 2 (Bob wants to go off call, concurrently):
BEGIN;
SELECT count(*) FROM doctors WHERE on_call = true AND shift_id = 1234;
-- Result: 2 (snapshot still shows both on call)
-- Application logic: 2 >= 2, so it's safe for one to leave
UPDATE doctors SET on_call = false WHERE name = 'Bob' AND shift_id = 1234;
COMMIT;

-- Result: BOTH Alice and Bob are now off call. Invariant violated!
-- Neither transaction wrote to the same row, so no write-write conflict
-- was detected. This is write skew.

Other examples of write skew include: double-booking a meeting room, claiming a unique username (two users register the same username simultaneously), and preventing double-spending.

Phantom Reads

A phantom read occurs when a transaction reads a set of rows matching a condition, and another transaction inserts or deletes a row that matches that condition. If the first transaction re-reads, it sees a different set of rows. This is related to write skew: in the on-call doctor example, the SELECT query checked a condition, but there was no single row to lock to prevent the concurrent transaction. The problematic pattern is:

A SELECT query checks some condition.
Based on the result, the application decides to proceed.
It makes a write (INSERT, UPDATE, DELETE) that changes the result of the original condition check.

If a row does not exist yet (INSERT case), there is nothing to lock. This is why preventing phantom reads requires special techniques like predicate locks or index-range locks (also called next-key locks in MySQL).

Two-Phase Locking (2PL)

Two-phase locking is the oldest and most widely used algorithm for serializable isolation. The rule is simple:

If transaction A has read an object and transaction B wants to write to that object, B must wait until A commits or aborts (and vice versa).
In 2PL, locks are acquired during the transaction (phase 1: growing) and released only when the transaction commits or aborts (phase 2: shrinking).

Crucially, readers and writers block each other (unlike snapshot isolation, where they never block). This means 2PL has significantly worse performance and is prone to deadlocks. If the database detects a deadlock (two transactions waiting for each other), it aborts one of them to break the cycle.

To handle phantoms, 2PL implementations use predicate locks (lock all rows matching a condition, including rows that do not yet exist) or, more commonly, index-range locks (a practical approximation that locks a larger range than strictly necessary but is much cheaper to implement).

Performance of 2PL

2PL has a well-deserved reputation for poor performance:

The overhead of acquiring and releasing locks is significant.
Reduced concurrency: a transaction holding a lock on frequently-accessed data forces all other transactions to wait.
Deadlock detection adds additional overhead.
A single slow transaction can cascade, blocking all other transactions and causing the entire system to grind to a halt.

For these reasons, many databases default to weaker isolation levels even though they know it allows concurrency bugs.

Serializable Snapshot Isolation (SSI)

SSI is a relatively new algorithm (described in a 2008 paper) that provides full serializability without the performance penalty of 2PL. It is used by PostgreSQL (since 9.1, as the "serializable" level) and CockroachDB.

SSI is optimistic: transactions proceed without blocking, just like snapshot isolation. At commit time, the database checks whether any concurrent transaction has created a conflict that would violate serializability. If so, one of the conflicting transactions is aborted and must be retried.

Specifically, SSI detects two situations:

Stale reads (MVCC): The transaction read a value that was subsequently overwritten by a committed transaction. The database tracks which objects were read and checks at commit time.
Write affecting prior reads: The transaction is about to write to an object that another concurrent transaction has already read. The database uses index-range locks (or similar structures) to detect these situations.

Compared to 2PL, SSI has much better performance because reads and writes do not block each other. The tradeoff is that some transactions will be aborted at commit time if conflicts are detected, so the application must be prepared to retry. In workloads with low contention (most transactions touch different data), very few transactions need to be aborted, and SSI performs close to snapshot isolation.

So far we have discussed transactions on a single database. Distributed transactions involve multiple nodes, databases, or services. They are significantly harder to implement correctly and come with serious performance trade-offs.

Two-Phase Commit (2PC)

Not to be confused with two-phase locking (2PL), two-phase commit is a protocol for achieving atomic commit across multiple nodes. A transaction coordinator manages the process:

Phase 1 (Prepare): The coordinator asks each participant "can you commit?" Each participant writes the transaction data to durable storage and responds yes or no.
Phase 2 (Commit/Abort): If all participants said yes, the coordinator writes a commit decision to its log and tells all participants to commit. If any participant said no, the coordinator tells all to abort.

The critical point: once a participant has voted "yes" in phase 1, it must wait for the coordinator's decision. If the coordinator crashes after phase 1, the participant is stuck - it has said "yes" but does not know whether to commit or abort. It cannot release its locks until it hears from the coordinator. This is why 2PC can block indefinitely if the coordinator fails, earning it the label "blocking protocol."

The Coordinator Is a Single Point of Failure

If the transaction coordinator crashes after sending "prepare" but before sending the final "commit" or "abort," all participants that voted "yes" are stuck in an in-doubt state. They hold locks on the data, blocking other transactions, and cannot proceed until the coordinator recovers. In practice, this means database administrators sometimes have to manually resolve in-doubt transactions, a process that risks data inconsistency if done incorrectly. This is one of the main reasons distributed transactions are used sparingly in production.

Three-Phase Commit (3PC)

3PC was designed as a non-blocking alternative to 2PC. It adds a "pre-commit" phase between prepare and commit. In theory, 3PC ensures that participants can reach a decision even if the coordinator crashes. However, 3PC assumes a network with bounded delay and nodes that respond within a bounded time - assumptions that do not hold in real networks. For this reason, 3PC is rarely used in practice.

The Saga Pattern

An alternative to distributed transactions for microservices architectures. A saga is a sequence of local transactions where each step has a corresponding compensating action (an undo operation). If step 3 fails, the system runs compensating actions for steps 2 and 1 to undo their effects.

Choreography: Each service listens for events and decides what to do next. Service A completes its step and publishes an event; Service B hears the event and performs its step. Simple but hard to track the overall progress.
Orchestration: A central saga orchestrator tells each service what to do and tracks the overall state. Easier to understand and debug but introduces a central coordinator.

Sagas provide eventual consistency, not ACID guarantees. The application must handle the temporary inconsistency between steps and design compensating actions carefully (they must be idempotent, since they may be retried).

Exactly-Once Processing and Idempotency

In distributed systems, network failures mean that a message may be delivered more than once (the sender does not know if the receiver processed it, so it retries). Achieving exactly-once processing typically relies on idempotency:

An operation is idempotent if performing it multiple times has the same effect as performing it once. For example, "set x = 5" is idempotent but "increment x by 1" is not.
To make a non-idempotent operation safe against retries, assign each request a unique idempotency key. The server stores the key and result of processed requests. On a retry, it returns the stored result instead of re-executing the operation.
Stripe, for example, accepts an Idempotency-Key header on payment API calls. If you retry the same call with the same key, it returns the result of the original call rather than charging the customer twice.

Idempotency Key Pattern// Client generates a unique key for each logical operation
const idempotencyKey = crypto.randomUUID();

// First attempt
POST /api/payments
Headers: { "Idempotency-Key": "550e8400-e29b-41d4-a716-446655440000" }
Body: { "amount": 1000, "currency": "USD", "to": "merchant_42" }
// Response: 201 Created, payment processed

// Network timeout - client does not know if it succeeded, so it retries
POST /api/payments
Headers: { "Idempotency-Key": "550e8400-e29b-41d4-a716-446655440000" }
Body: { "amount": 1000, "currency": "USD", "to": "merchant_42" }
// Response: 200 OK (returns cached result, payment NOT processed again)

Choosing the Right Isolation Level

The right choice depends on your application's requirements:

Read committed: Acceptable for many OLTP workloads where occasional read skew is tolerable. The default in PostgreSQL.
Snapshot isolation: Good for read-heavy workloads, long-running queries (analytics), and applications that need consistent reads. Prevents most anomalies except write skew. Default in MySQL/InnoDB.
Serializable (2PL): When correctness is paramount and you can tolerate lower throughput and higher latency. Used in traditional banking systems.
Serializable (SSI): When you want serializability with better performance than 2PL and can tolerate transaction retries. Used by CockroachDB and PostgreSQL's serializable level.
No transactions (eventual consistency): For high-throughput systems where the application can tolerate temporary inconsistency. Common in NoSQL databases and microservices architectures using sagas.

Key Takeaways

ACID is less precise than it sounds. Atomicity is about abortability, Consistency is an application property, Isolation has many levels, and Durability means different things for single-node vs. replicated databases.
Most databases do not actually give you serializable isolation by default. Read committed and snapshot isolation are the most common defaults, and both allow certain concurrency anomalies.
Write skew is the trickiest anomaly: two transactions read the same data and write to different objects, violating an invariant that neither write alone would violate. Only serializable isolation prevents it.
SSI (serializable snapshot isolation) offers the best of both worlds: full serializability with performance close to snapshot isolation, at the cost of potential transaction aborts.
Distributed transactions (2PC) are correct but fragile: a coordinator crash can block participants indefinitely. Sagas are the practical alternative for microservices, trading ACID for eventual consistency.
Idempotency keys are the standard mechanism for achieving exactly-once semantics in distributed systems where retries are inevitable.

Transactions & Isolation Levels