Why Data Models Matter
A data model determines the API through which your application interacts with its storage layer. It constrains the queries you can express efficiently, the relationships you can model naturally, and the patterns that become awkward or impossible. Choosing a data model is one of the most consequential architectural decisions you will make, because migrating between models later is extremely costly.
Most applications are built using one of three dominant data models: relational, document, or graph. Each emerged to solve a different class of problem, and each makes certain operations easy at the expense of others.
The Relational Model
The relational model, proposed by Edgar Codd in 1970, organizes data into relations (tables), where each relation is an unordered collection of tuples (rows). This model dominated business data processing for decades, and for good reason: it provides a clean abstraction that hides the complexity of the underlying storage engine.
Tables, Rows, and Joins
In the relational model, data is normalized: each fact is stored in exactly one place, and relationships between entities are expressed through foreign keys and joins.
CREATE TABLE users (
user_id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
user_id INT REFERENCES users(user_id),
total DECIMAL(10,2),
status VARCHAR(20),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE order_items (
item_id SERIAL PRIMARY KEY,
order_id INT REFERENCES orders(order_id),
product_id INT REFERENCES products(product_id),
quantity INT,
price DECIMAL(10,2)
);
-- Join to get a user's order history with items:
SELECT u.name, o.order_id, o.total, oi.product_id, oi.quantity
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE u.user_id = 42;Normalization
Normalization is the process of structuring data to eliminate redundancy. There are several normal forms, but the key principle is: each fact should be stored exactly once. If a user's city changes, you update one row in the users table rather than hunting through duplicated data across multiple tables.
- 1NF: Each column contains atomic values (no arrays or nested structures).
- 2NF: Every non-key column depends on the entire primary key, not just part of it.
- 3NF: No non-key column depends on another non-key column (no transitive dependencies).
The trade-off is that highly normalized data requires joins to reassemble, which can be expensive at scale. This tension between normalization and query performance is a recurring theme in system design.
SQL: The Relational Query Language
SQL is a declarative query language: you describe what data you want, and the query optimizer decides how to retrieve it. This is a powerful abstraction because it allows the database engine to change its execution strategy (e.g., choosing between a sequential scan and an index scan) without changing your query.
The Document Model
Document databases (MongoDB, CouchDB, Amazon DocumentDB) store data as self-contained documents, typically in JSON or BSON format. Each document contains all the data relevant to that entity, including nested objects and arrays.
Schema Flexibility
Document databases are often described as "schemaless," but a more accurate term is schema-on-read: the structure of the data is implicit, and only interpreted when the data is read. This contrasts with the relational model's schema-on-write, where the database enforces structure at write time.
{
"_id": "user_42",
"name": "Alice Chen",
"email": "alice@example.com",
"address": {
"street": "123 Main St",
"city": "San Francisco",
"state": "CA",
"zip": "94105"
},
"orders": [
{
"order_id": "ord_001",
"total": 89.99,
"status": "delivered",
"items": [
{ "product": "Mechanical Keyboard", "qty": 1, "price": 89.99 }
]
},
{
"order_id": "ord_002",
"total": 245.00,
"status": "shipped",
"items": [
{ "product": "Monitor Stand", "qty": 1, "price": 45.00 },
{ "product": "USB-C Hub", "qty": 2, "price": 100.00 }
]
}
]
}Notice how a single document contains the user, their address, and all their orders with items. In a relational database, this would span four tables and require multiple joins to reassemble. The document model provides data locality: all the data you need is in one place, which can dramatically reduce the number of disk seeks for read-heavy access patterns.
When Documents Excel
- One-to-many relationships: When an entity owns a tree of nested data (a user's orders, a blog post's comments), documents model this naturally.
- Heterogeneous data: When different records have different fields (e.g., product catalogs where a laptop has CPU/RAM specs but a shirt has size/color).
- Read-heavy workloads: When you usually load the entire entity at once, data locality avoids expensive joins.
When Documents Struggle
- Many-to-many relationships: If orders reference shared products, or users reference shared organizations, you end up duplicating data or performing application-level joins.
- Cross-document queries: Aggregations and joins across documents are limited and often require MapReduce or aggregation pipelines.
- Deeply nested updates: Updating a specific item buried three levels deep in a nested array is awkward in most document databases.
The Graph Model
When your data is dominated by many-to-many relationships, neither the relational nor the document model is a natural fit. Graph databases (Neo4j, Amazon Neptune, TigerGraph) model data as vertices (nodes) connected by edges (relationships), making it easy to traverse complex webs of connections.
Property Graphs
In the property graph model, each vertex and edge can have an arbitrary set of key-value properties. This is the model used by Neo4j and most modern graph databases.
Vertex (Node):
- Unique ID
- Set of outgoing edges
- Set of incoming edges
- Properties (key-value pairs)
Edge (Relationship):
- Unique ID
- Tail vertex (start node)
- Head vertex (end node)
- Label (relationship type)
- Properties (key-value pairs)Cypher Query Language
Cypher is a declarative query language for property graphs, used by Neo4j. It makes graph traversal queries concise and readable.
-- Find all friends-of-friends of Alice who live in New York:
MATCH (alice:Person {name: "Alice"})-[:FRIENDS_WITH]->()
-[:FRIENDS_WITH]->(fof:Person)
WHERE fof.city = "New York" AND fof <> alice
RETURN DISTINCT fof.name;
-- Find the shortest path between two users:
MATCH path = shortestPath(
(a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(b:Person {name: "Bob"})
)
RETURN path;Triple Stores and SPARQL
An alternative graph model is the triple store, where all information is stored as three-part statements: (subject, predicate, object). This model is used by RDF databases and is query-able using SPARQL.
# Find all people who live in cities within the US:
SELECT ?person ?city WHERE {
?person :livesIn ?city .
?city :locatedIn :UnitedStates .
}Triple stores are common in knowledge graphs, semantic web applications, and situations where you need to integrate data from many heterogeneous sources under a unified schema.
Relational vs Document vs Graph: When to Use Which
Relational (PostgreSQL, MySQL)
- Best for structured data with well-defined schemas.
- Strong ACID transactions and referential integrity.
- Powerful join operations for many-to-many relationships.
- Mature ecosystem with decades of optimization.
- Use when: e-commerce, banking, ERP, any OLTP workload with complex relationships and strict consistency.
Document (MongoDB, DynamoDB)
- Best for semi-structured data with variable schemas.
- Excellent data locality for entity-centric reads.
- Scales horizontally with sharding out of the box.
- Schema-on-read gives flexibility during rapid iteration.
- Use when: content management, product catalogs, user profiles, mobile app backends, event logging.
Graph (Neo4j, Neptune)
- Best for highly connected data with deep traversals.
- Querying relationships is O(1) per hop, not O(n) like SQL joins.
- Natural for recursive and variable-depth queries.
- Weaker at bulk analytical processing compared to relational.
- Use when: social networks, recommendation engines, fraud detection, knowledge graphs, network topology.
Decision Framework
| Factor | Relational | Document | Graph |
|---|---|---|---|
| Primary data shape | Tabular, rows & columns | Nested JSON documents | Nodes & edges |
| Schema enforcement | Schema-on-write (strict) | Schema-on-read (flexible) | Schema-optional |
| Relationships | Foreign keys + joins | Embedded or references | First-class edges |
| Many-to-many support | Excellent (joins) | Poor (denormalization) | Excellent (traversals) |
| Query expressiveness | SQL (very expressive) | Query by example / MQL | Cypher / SPARQL / Gremlin |
| Horizontal scaling | Harder (sharding joins) | Native sharding | Challenging at scale |
| Transactions | Full ACID | Single-document ACID | Varies by product |
| Example use cases | Banking, ERP, e-commerce | CMS, catalogs, profiles | Social networks, fraud |
Hybrid Approaches: The Best of Multiple Worlds
In practice, the boundaries between data models have blurred significantly. Modern databases increasingly support multiple models within a single system.
PostgreSQL JSONB
PostgreSQL added native JSONB support, allowing you to store semi-structured data alongside relational tables. You get the best of both worlds: ACID transactions, joins, and SQL for structured data, plus flexible JSON for fields that vary between records.
CREATE TABLE products (
product_id SERIAL PRIMARY KEY,
name VARCHAR(200) NOT NULL,
category VARCHAR(50),
price DECIMAL(10,2),
attributes JSONB -- flexible product-specific fields
);
-- Index into the JSON for fast queries:
CREATE INDEX idx_attributes_color
ON products USING GIN ((attributes->'color'));
-- Query combining relational and JSON:
SELECT name, price, attributes->>'color' AS color
FROM products
WHERE category = 'clothing'
AND attributes @> '{"size": "L"}'
AND price < 50.00;Multi-Model Databases
- ArangoDB: Supports document, graph, and key-value access in a single engine with a unified query language (AQL).
- CosmosDB: Microsoft's globally distributed database offering document, graph, column-family, and key-value APIs over the same backend.
- SurrealDB: A newer multi-model database supporting documents, graphs, and relational queries with built-in real-time capabilities.
Key Takeaway
There is no universally "best" data model. The right choice depends on your data's relationship structure, your query patterns, and your consistency requirements. Start by understanding the shape of your data: if it is tabular and well-structured, use relational. If it is hierarchical and entity-centric, use document. If it is a dense web of interconnections, use graph. And in many real systems, you will use more than one model for different parts of the application.
Query Language Paradigms
Declarative vs Imperative
SQL and Cypher are declarative: you state what result you want, and the database figures out how to compute it. This is advantageous because the query optimizer can automatically parallelize execution, choose optimal join strategies, and take advantage of indexes without changes to your queries.
In contrast, imperative approaches (like iterating over documents in application code to perform a join) give you more control but lose the optimization benefits. MapReduce sits between the two: it is neither fully declarative nor fully imperative.
MapReduce Querying
MapReduce, popularized by MongoDB's aggregation pipeline and Hadoop, processes large datasets by splitting work into map and reduce phases. While powerful for batch processing, it is being replaced by more declarative alternatives in most modern databases.
// Count orders per city in the last 30 days:
db.orders.aggregate([
{ $match: { created_at: { $gte: new Date("2026-01-18") } } },
{ $lookup: {
from: "users",
localField: "user_id",
foreignField: "_id",
as: "user"
}},
{ $unwind: "$user" },
{ $group: {
_id: "$user.address.city",
total_orders: { $sum: 1 },
revenue: { $sum: "$total" }
}},
{ $sort: { revenue: -1 } }
]);Key Takeaways
- The relational model excels at structured data with complex many-to-many relationships and strong consistency guarantees.
- The document model provides data locality and schema flexibility, ideal for entity-centric access patterns with one-to-many relationships.
- The graph model makes many-to-many relationships first-class citizens, enabling efficient traversals that would require expensive recursive joins in SQL.
- Hybrid approaches like PostgreSQL JSONB and multi-model databases let you combine the strengths of multiple models in a single system.
- Prefer declarative query languages (SQL, Cypher) over imperative data processing to let the database optimizer work for you.