Database Fundamentals

The database is the heart of almost every system. Choosing the right database, designing the schema, and understanding how data is stored and retrieved are among the most consequential system design decisions you will make.

Relational Databases (SQL)

Relational databases store data in structured tables with rows and columns. Relationships between tables are expressed through foreign keys. They use SQL (Structured Query Language) for querying.

Examples: PostgreSQL, MySQL, Oracle, SQL Server, SQLite.

ACID Properties

Relational databases guarantee ACID transactions:

  • AtomicityA transaction either completes entirely or not at all. If any part fails, the entire transaction is rolled back.
  • ConsistencyA transaction moves the database from one valid state to another. All constraints (unique, foreign key, check) are enforced.
  • IsolationConcurrent transactions behave as if they were executed sequentially. Different isolation levels trade strictness for performance.
  • DurabilityOnce a transaction is committed, the data is permanently written, even if the system crashes immediately after.

Indexing

An index is a data structure (usually a B-tree or B+ tree) that speeds up read queries by avoiding full table scans. Without an index, the database must examine every row to find matches. With an index, it can jump directly to the relevant rows.

Trade-off
Indexes speed up reads but slow down writes. Every INSERT, UPDATE, or DELETE must also update the index. Over-indexing a table that handles heavy writes will degrade performance. Index only columns that appear frequently in WHERE clauses, JOIN conditions, or ORDER BY clauses.

Normalization

Normalization organizes data to reduce redundancy and improve integrity:

  • 1NF: Each column holds atomic values; no repeating groups.
  • 2NF: 1NF + every non-key column depends on the entire primary key.
  • 3NF: 2NF + no transitive dependencies (non-key columns depend only on the primary key).

Normalization reduces data duplication but can increase the number of JOINs needed for queries. In high-performance systems, deliberate denormalization, duplicating data across tables, is common to avoid expensive joins.

Non-Relational Databases (NoSQL)

NoSQL databases sacrifice some relational guarantees for flexibility, scalability, or performance. There are four primary categories:

NoSQL Database Categories
Key-Value Redis, DynamoDB Memcached Document MongoDB, CouchDB Firestore Column-Family Cassandra, HBase ScyllaDB Graph Neo4j, Neptune JanusGraph Simple lookups Session storage Caching layer Flexible schemas Content management User profiles Time-series data Write-heavy workloads IoT / analytics Relationships Social networks Recommendation

Key-Value Stores

The simplest NoSQL model. Data is stored as key-value pairs. Keys are unique identifiers; values can be anything (strings, JSON, binary). Operations are limited: GET, SET, DELETE. Extremely fast for simple lookups.

Use cases: Caching, session storage, feature flags, rate limiting counters.

Document Stores

Store data as documents (typically JSON or BSON). Each document can have a different structure. Documents are grouped into collections. Support rich queries within documents.

Use cases: Content management systems, user profiles, product catalogs, any domain where schema flexibility matters.

Column-Family Stores

Data is organized by columns rather than rows. Each row can have a different set of columns, and columns are grouped into families. Optimized for writing large volumes of data and for reading specific columns across many rows.

Use cases: Time-series data, IoT event logs, analytics, write-heavy workloads.

Graph Databases

Store data as nodes and edges (relationships). Queries traverse the graph efficiently, making them ideal for connected data. Finding the shortest path or all friends of friends is natural and fast.

Use cases: Social networks, fraud detection, recommendation engines, knowledge graphs.

SQL vs. NoSQL: When to Choose Which

Choose SQL When

  • Your data has clear relationships and a stable schema.
  • You need ACID transactions.
  • You need complex queries with JOINs, aggregations, subqueries.
  • Data integrity is critical (financial, medical).
  • Your read and write patterns are balanced.

Choose NoSQL When

  • Your schema evolves frequently or varies per record.
  • You need horizontal scalability above all else.
  • Your access patterns are simple (key lookups, document reads).
  • You handle massive write volumes.
  • Eventual consistency is acceptable for your use case.
In Practice
Most large systems use both SQL and NoSQL. For example, a user profile might live in PostgreSQL (relational, ACID), while the user's activity feed is stored in Cassandra (write-heavy, time-series), and session tokens are cached in Redis (key-value, fast). This is polyglot persistence.

Transactions and Isolation Levels

When multiple transactions run concurrently, isolation levels control how much they see of each other's uncommitted work:

Isolation LevelDirty ReadsNon-Repeatable ReadsPhantom ReadsPerformance
Read UncommittedPossiblePossiblePossibleFastest
Read CommittedPreventedPossiblePossibleFast
Repeatable ReadPreventedPreventedPossibleModerate
SerializablePreventedPreventedPreventedSlowest

Most databases default to Read Committed. PostgreSQL uses Repeatable Read via MVCC (Multi-Version Concurrency Control), which provides each transaction a consistent snapshot of the database without locking reads.

Storage Engines

Under the hood, databases use storage engines that determine how data is physically stored and retrieved:

B-Tree Based (e.g., InnoDB, PostgreSQL)

Organizes data in a balanced tree structure. Each node contains keys and pointers to child nodes. Leaf nodes contain actual data or pointers to data pages. Excellent for read-heavy workloads and range queries. Updates are done in-place.

LSM-Tree Based (e.g., RocksDB, LevelDB, Cassandra)

Log-Structured Merge trees batch writes into in-memory buffers (memtables), then flush them to sorted files (SSTables) on disk. Writes are always sequential, making them extremely fast. Background compaction merges SSTables. Better for write-heavy workloads.

B-Tree

  • In-place updates
  • Fast random reads
  • Write amplification from page splits
  • Better read/write balance

LSM-Tree

  • Append-only writes
  • Higher write throughput
  • Read amplification from multiple levels
  • Better for write-heavy workloads

Key Takeaways

  • Relational databases provide strong guarantees (ACID) and are the right default for structured data with relationships.
  • NoSQL databases offer flexibility and horizontal scalability, each type optimized for specific access patterns.
  • Indexing is the most impactful optimization for read performance, but over-indexing hurts writes.
  • Understand your data access patterns before choosing a database. Read-heavy vs. write-heavy, simple lookups vs. complex queries, single-record vs. range scans.
  • Most production systems use multiple databases (polyglot persistence) to match each workload to its optimal store.

Chapter Check-Up

Quick quiz to reinforce what you just learned.

๐Ÿงช

Practice What You Learned

Build a replication setup and observe how read replicas reduce primary database load.

Start Guided Lab โ†’