Databases in System Design | System Design Course

Relational Databases (SQL)

Relational databases store data in structured tables with rows and columns. Relationships between tables are expressed through foreign keys. They use SQL (Structured Query Language) for querying.

Examples: PostgreSQL, MySQL, Oracle, SQL Server, SQLite.

ACID Properties

Relational databases guarantee ACID transactions:

AtomicityA transaction either completes entirely or not at all. If any part fails, the entire transaction is rolled back.
ConsistencyA transaction moves the database from one valid state to another. All constraints (unique, foreign key, check) are enforced.
IsolationConcurrent transactions behave as if they were executed sequentially. Different isolation levels trade strictness for performance.
DurabilityOnce a transaction is committed, the data is permanently written, even if the system crashes immediately after.

Indexing

An index is a data structure (usually a B-tree or B+ tree) that speeds up read queries by avoiding full table scans. Without an index, the database must examine every row to find matches. With an index, it can jump directly to the relevant rows.

Trade-off

Indexes speed up reads but slow down writes. Every INSERT, UPDATE, or DELETE must also update the index. Over-indexing a table that handles heavy writes will degrade performance. Index only columns that appear frequently in WHERE clauses, JOIN conditions, or ORDER BY clauses.

Normalization

Normalization organizes data to reduce redundancy and improve integrity:

1NF: Each column holds atomic values; no repeating groups.
2NF: 1NF + every non-key column depends on the entire primary key.
3NF: 2NF + no transitive dependencies (non-key columns depend only on the primary key).

Normalization reduces data duplication but can increase the number of JOINs needed for queries. In high-performance systems, deliberate denormalization, duplicating data across tables, is common to avoid expensive joins.

Non-Relational Databases (NoSQL)

NoSQL databases sacrifice some relational guarantees for flexibility, scalability, or performance. There are four primary categories:

NoSQL Database Categories

Key-Value Stores

The simplest NoSQL model. Data is stored as key-value pairs. Keys are unique identifiers; values can be anything (strings, JSON, binary). Operations are limited: GET, SET, DELETE. Extremely fast for simple lookups.

Use cases: Caching, session storage, feature flags, rate limiting counters.

Document Stores

Store data as documents (typically JSON or BSON). Each document can have a different structure. Documents are grouped into collections. Support rich queries within documents.

Use cases: Content management systems, user profiles, product catalogs, any domain where schema flexibility matters.

Column-Family Stores

Data is organized by columns rather than rows. Each row can have a different set of columns, and columns are grouped into families. Optimized for writing large volumes of data and for reading specific columns across many rows.

Use cases: Time-series data, IoT event logs, analytics, write-heavy workloads.

Graph Databases

Store data as nodes and edges (relationships). Queries traverse the graph efficiently, making them ideal for connected data. Finding the shortest path or all friends of friends is natural and fast.

Use cases: Social networks, fraud detection, recommendation engines, knowledge graphs.

SQL vs. NoSQL: When to Choose Which

Choose SQL When

Your data has clear relationships and a stable schema.
You need ACID transactions.
You need complex queries with JOINs, aggregations, subqueries.
Data integrity is critical (financial, medical).
Your read and write patterns are balanced.

Choose NoSQL When

Your schema evolves frequently or varies per record.
You need horizontal scalability above all else.
Your access patterns are simple (key lookups, document reads).
You handle massive write volumes.
Eventual consistency is acceptable for your use case.

In Practice

Most large systems use both SQL and NoSQL. For example, a user profile might live in PostgreSQL (relational, ACID), while the user's activity feed is stored in Cassandra (write-heavy, time-series), and session tokens are cached in Redis (key-value, fast). This is polyglot persistence.

Transactions and Isolation Levels

When multiple transactions run concurrently, isolation levels control how much they see of each other's uncommitted work:

Isolation Level	Dirty Reads	Non-Repeatable Reads	Phantom Reads	Performance
Read Uncommitted	Possible	Possible	Possible	Fastest
Read Committed	Prevented	Possible	Possible	Fast
Repeatable Read	Prevented	Prevented	Possible	Moderate
Serializable	Prevented	Prevented	Prevented	Slowest

Most databases default to Read Committed. PostgreSQL uses Repeatable Read via MVCC (Multi-Version Concurrency Control), which provides each transaction a consistent snapshot of the database without locking reads.

Storage Engines

Under the hood, databases use storage engines that determine how data is physically stored and retrieved:

B-Tree Based (e.g., InnoDB, PostgreSQL)

Organizes data in a balanced tree structure. Each node contains keys and pointers to child nodes. Leaf nodes contain actual data or pointers to data pages. Excellent for read-heavy workloads and range queries. Updates are done in-place.

LSM-Tree Based (e.g., RocksDB, LevelDB, Cassandra)

Log-Structured Merge trees batch writes into in-memory buffers (memtables), then flush them to sorted files (SSTables) on disk. Writes are always sequential, making them extremely fast. Background compaction merges SSTables. Better for write-heavy workloads.

B-Tree

In-place updates
Fast random reads
Write amplification from page splits
Better read/write balance

LSM-Tree

Append-only writes
Higher write throughput
Read amplification from multiple levels
Better for write-heavy workloads

Key Takeaways

Relational databases provide strong guarantees (ACID) and are the right default for structured data with relationships.
NoSQL databases offer flexibility and horizontal scalability, each type optimized for specific access patterns.
Indexing is the most impactful optimization for read performance, but over-indexing hurts writes.
Understand your data access patterns before choosing a database. Read-heavy vs. write-heavy, simple lookups vs. complex queries, single-record vs. range scans.
Most production systems use multiple databases (polyglot persistence) to match each workload to its optimal store.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

Database Fundamentals