The Simplest Database: An Append-Only Log
The simplest possible storage engine is an append-only file. Every write appends a key-value pair to the end of the file. Reads scan the file backwards to find the most recent value for a given key.
# Write: O(1) -- just append
def put(key, value):
file.append(f"{key},{value}\n")
# Read: O(n) -- scan from end
def get(key):
for line in reversed(file.readlines()):
k, v = line.split(",")
if k == key:
return v
return NoneWrites are extremely fast (sequential I/O), but reads are O(n) because we must scan the entire log. To make reads efficient, we need an index: a separate data structure that helps locate data without scanning everything. The fundamental trade-off of storage engines is that indexes speed up reads but slow down writes, because every write must also update the index.
Log-Structured Storage
Log-structured storage engines build on the append-only idea, adding clever data structures to make both reads and writes efficient. This family includes bitcask, SSTables, LSM trees, and the engines behind Cassandra, HBase, RocksDB, and LevelDB.
Hash Index
The simplest improvement: maintain an in-memory hash map that maps each key to its byte offset in the log file. Writes append to the log and update the hash map. Reads look up the offset in the hash map and seek directly to that position on disk.
- Pros: O(1) reads and writes. Simple to implement.
- Cons: The entire key set must fit in memory. Range queries are not supported (you cannot efficiently find all keys between "user_100" and "user_200").
This is the approach used by Bitcask (the default storage engine in Riak). It works well when the number of distinct keys is manageable but each key is updated frequently.
SSTables (Sorted String Tables)
An SSTable is a file where key-value pairs are sorted by key. This simple constraint enables powerful optimizations:
- Efficient merging: Merging multiple SSTables is like merge-sort -- read from each file in order, producing a new sorted file. This is the basis of compaction.
- Sparse indexing: You only need an in-memory index for some keys. To find a key between two indexed entries, scan the short segment between them.
- Block compression: Because keys are sorted, nearby keys can be compressed together, reducing disk I/O and space usage.
LSM Trees (Log-Structured Merge Trees)
LSM trees are the dominant log-structured storage engine, used by RocksDB, LevelDB, Cassandra, ScyllaDB, and many more. The architecture works as follows:
Step 1: Write to Memtable
Incoming writes go to an in-memory balanced tree (red-black tree or skip list) called the memtable. This keeps writes fast and maintains sorted order in memory. A write-ahead log (WAL) on disk ensures durability in case of crashes.
Step 2: Flush to SSTable
When the memtable exceeds a size threshold (typically 64-256 MB), it is flushed to disk as a new SSTable file. Because the memtable was sorted, the resulting SSTable is automatically sorted. The WAL for the flushed memtable is discarded.
Step 3: Compaction
Over time, many SSTables accumulate. A background compaction process merges them, discarding deleted keys (tombstones) and superseded values. There are two main compaction strategies: size-tiered (group similarly-sized SSTables) and leveled (organize SSTables into levels with non-overlapping key ranges).
Step 4: Read Path
To read a key: first check the memtable, then check SSTables from newest to oldest. A Bloom filter for each SSTable quickly tells you whether a key might be present, avoiding unnecessary disk reads for keys that do not exist.
Client Write("user_42", "{name: Alice}")
|
v
Write-Ahead Log (WAL) -- sequential append for durability
|
v
Memtable (in-memory sorted tree)
| (when size threshold reached)
v
Flush to SSTable on disk (sorted, immutable)
| (background)
v
Compaction: merge SSTables, remove tombstonesPage-Oriented Storage: B-Trees
B-trees are the most widely used index structure, powering PostgreSQL, MySQL (InnoDB), Oracle, and SQL Server. Unlike log-structured engines which write sequentially, B-trees maintain a sorted data structure that is updated in place.
How B-Trees Work
A B-tree organizes data into fixed-size pages (typically 4-16 KB), arranged in a tree hierarchy. Each page contains sorted keys and pointers to child pages. The tree has a branching factor (number of child pointers per page), typically in the hundreds.
[100 | 200 | 300] -- root page
/ | | \
[10|50|80] [120|150] [220|250] [320|350|400] -- internal pages
/ | | \ / | \ / | \ / | | \
... ... ... ... ... ... ... ... ... ... ... ... ... -- leaf pages
Branching factor: 4 (simplified; real B-trees have 100-500+)
Depth for 1 billion keys with branching factor 500:
500^4 = 62.5 billion > 1 billion
Only 4 levels deep! (root + 3 levels)A four-level B-tree with a branching factor of 500 can store up to 62.5 billion keys. Since the root page is usually cached in memory, looking up any key requires at most 3 disk reads.
B-Tree Writes
To write a value, the B-tree finds the leaf page where the key belongs and updates it in place. If the page is full, it is split into two pages, and the parent page is updated to point to both. Splits can cascade up the tree but are infrequent.
In-place updates mean the database overwrites existing pages on disk. This is fundamentally different from log-structured engines, which never modify existing data. To protect against crashes during a write (which could leave a page half-written), B-trees use a write-ahead log (WAL) -- the same durability mechanism as LSM trees but for a different reason.
B+ Trees
Most databases actually use B+ trees, a variant where all values are stored in leaf pages, and internal pages contain only keys and child pointers. Leaf pages are linked together in a doubly-linked list, enabling efficient range scans by following pointers from one leaf to the next without traversing back up the tree.
LSM Trees vs B-Trees
LSM Trees
- Writes are sequential (append-only), which is much faster on HDDs and SSDs.
- Higher write throughput, especially for write-heavy workloads.
- Better compression: sorted SSTables compress well and have no page fragmentation.
- Compaction can cause latency spikes and compete for disk bandwidth.
- A key may exist in multiple SSTables, making reads slower (mitigated by Bloom filters).
- Write amplification from compaction (data rewritten multiple times).
B-Trees
- Predictable read performance: each key exists in exactly one place.
- Strong transaction support: easy to attach locks to ranges of keys.
- More predictable latency (no compaction storms).
- Write amplification from WAL + page writes (data written at least twice).
- Page fragmentation can waste space over time.
- In-place updates require careful concurrency control (latches/locks).
| Property | LSM Trees | B-Trees |
|---|---|---|
| Write pattern | Sequential (append) | Random (in-place update) |
| Write throughput | Higher | Lower |
| Read latency | Higher (multiple SSTables) | Lower (single location) |
| Space efficiency | Better (compaction removes waste) | Worse (page fragmentation) |
| Write amplification | Compaction rewrites | WAL + page writes |
| Compaction overhead | Yes (can cause latency spikes) | No |
| Range queries | Good (sorted SSTables) | Excellent (B+ tree leaf chain) |
| Concurrency | Simpler (immutable files) | Requires page-level latches |
| Used by | RocksDB, Cassandra, LevelDB | PostgreSQL, MySQL, Oracle |
Column-Oriented Storage
Row-oriented storage engines (both LSM and B-tree) store all columns of a row together. This is efficient for transactional workloads that read and write entire rows. But analytical queries often need only a few columns from millions of rows.
How Columnar Storage Works
Column-oriented storage stores each column's values together in a separate file. A query that needs only 3 columns out of 100 reads only 3 files instead of scanning all 100 columns for every row.
Row-oriented (PostgreSQL, MySQL):
Row 1: [user_id=1, name="Alice", city="NYC", age=30, ...]
Row 2: [user_id=2, name="Bob", city="LA", age=25, ...]
Row 3: [user_id=3, name="Carol", city="NYC", age=35, ...]
Column-oriented (Redshift, ClickHouse, Parquet):
user_id column: [1, 2, 3, ...]
name column: ["Alice", "Bob", "Carol", ...]
city column: ["NYC", "LA", "NYC", ...]
age column: [30, 25, 35, ...]
Query: SELECT city, COUNT(*) FROM users GROUP BY city
Row store: reads ALL columns for every row (wasteful)
Column store: reads ONLY the city column (efficient)Columnar Compression
Because a column contains values of the same type and often with many repetitions, columnar data compresses extremely well. Common techniques include:
- Bitmap encoding: For columns with few distinct values (e.g., country), store a bitmap for each distinct value indicating which rows have that value.
- Run-length encoding (RLE): If the same value repeats many times (after sorting), store the value and the count instead of repeating it.
- Dictionary encoding: Replace string values with integer codes from a dictionary, then store the integers (which compress better).
Materialized Views and Data Cubes
Analytical databases often precompute aggregations and store them as materialized views. A data cube (OLAP cube) precomputes aggregations along multiple dimensions (e.g., total revenue by product, region, and month), enabling instant lookups at the cost of storage space and write overhead.
In-Memory Databases
As RAM prices have dropped, databases that keep their entire dataset in memory have become practical for datasets that fit within available RAM (up to several terabytes on modern servers).
Redis Internals
Redis stores all data in memory using optimized data structures: hash tables for key-value lookups, skip lists for sorted sets, and compact encodings for small collections (ziplists, intsets).
Strings: Simple key-value (GET/SET)
Hash: Field-value pairs within a key (HGET/HSET)
List: Doubly-linked list (LPUSH/RPOP) -- message queues
Set: Unordered unique elements (SADD/SMEMBERS)
Sorted Set: Elements scored by a float (ZADD/ZRANGE) -- leaderboards
Stream: Append-only log with consumer groups -- event streaming
Durability strategies:
RDB snapshots: periodic full dump to disk (fork + copy-on-write)
AOF (Append-Only File): log every write command, replay on restart
Hybrid: RDB for fast restart + AOF for recent writesWhy In-Memory is Faster
Counterintuitively, the speed advantage of in-memory databases is not primarily about avoiding disk I/O (the OS page cache already keeps frequently accessed data in memory). The real advantage is avoiding the overhead of encoding data into a disk-friendly format. In-memory databases can use data structures like skip lists, hash tables, and tries that would be impractical on disk.
Indexing Strategies
An index is a derived data structure that enables efficient lookups by specific fields. Every index speeds up reads but slows down writes (because the index must be updated on every insert, update, and delete).
Primary vs Secondary Indexes
- Primary index: The main index, usually on the primary key. Determines how data is physically stored (clustered index in InnoDB).
- Secondary index: Additional indexes on non-primary-key columns. Each entry stores a pointer to the row (heap file) or a copy of the primary key (index-organized tables).
Composite Indexes
A composite (multi-column) index sorts data by the first column, then by the second column within ties, and so on. The order of columns matters enormously.
CREATE INDEX idx_city_age ON users(city, age);
-- This index helps:
SELECT * FROM users WHERE city = 'NYC' AND age > 25; -- uses both columns
SELECT * FROM users WHERE city = 'NYC'; -- uses first column
-- This index does NOT help:
SELECT * FROM users WHERE age > 25; -- cannot skip the first column
-- Rule: a composite index (A, B, C) supports queries on:
-- (A), (A, B), (A, B, C)
-- but NOT (B), (C), or (B, C) aloneCovering Indexes
A covering index includes all the columns needed by a query, so the database can answer the query entirely from the index without looking up the actual row. This eliminates the extra disk seek to the heap file.
-- Query: find names of users in NYC
SELECT name FROM users WHERE city = 'NYC';
-- Regular index on (city): must look up each matching row to get name
-- Covering index on (city) INCLUDE (name): returns name directly from index
CREATE INDEX idx_city_covering ON users(city) INCLUDE (name);
-- Trade-off: larger index size, slower writes, but faster reads for this queryFull-Text and Fuzzy Indexes
Standard B-tree and LSM indexes only support exact-match and range queries. For full-text search, databases use inverted indexes (as in Elasticsearch and Lucene): a mapping from each word to the list of documents containing it. Fuzzy search uses techniques like edit-distance automata and trigram indexes to find approximate matches.
Key Takeaway
The storage engine is the foundation that determines your database's performance characteristics. LSM trees optimize for write throughput with sequential I/O and compaction, while B-trees optimize for read performance with predictable latency. Column-oriented storage transforms analytical query performance by reading only the columns needed. Understanding these internals helps you choose the right database for your workload and tune it effectively.
Key Takeaways
- Log-structured engines (LSM trees) convert random writes into sequential writes, achieving higher write throughput at the cost of compaction overhead and potentially slower reads.
- B-trees provide predictable read performance and strong transactional support, making them the default for most OLTP databases.
- Column-oriented storage dramatically accelerates analytical queries by reading only the needed columns and enabling aggressive compression.
- In-memory databases are faster primarily because they can use memory-optimized data structures, not just because they avoid disk I/O.
- Every index is a trade-off: faster reads at the cost of slower writes and more storage. Choose indexes based on your actual query patterns.
- Composite indexes follow the leftmost prefix rule: an index on (A, B, C) helps queries filtering on A, or A+B, or A+B+C, but not B or C alone.