Data Models and Query Languages | System Design Course

Why Data Models Matter

A data model determines the API through which your application interacts with its storage layer. It constrains the queries you can express efficiently, the relationships you can model naturally, and the patterns that become awkward or impossible. Choosing a data model is one of the most consequential architectural decisions you will make, because migrating between models later is extremely costly.

Most applications are built using one of three dominant data models: relational, document, or graph. Each emerged to solve a different class of problem, and each makes certain operations easy at the expense of others.

The Relational Model

The relational model, proposed by Edgar Codd in 1970, organizes data into relations (tables), where each relation is an unordered collection of tuples (rows). This model dominated business data processing for decades, and for good reason: it provides a clean abstraction that hides the complexity of the underlying storage engine.

Tables, Rows, and Joins

In the relational model, data is normalized: each fact is stored in exactly one place, and relationships between entities are expressed through foreign keys and joins.

SQL Schema ExampleCREATE TABLE users (
    user_id    SERIAL PRIMARY KEY,
    name       VARCHAR(100) NOT NULL,
    email      VARCHAR(255) UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE orders (
    order_id   SERIAL PRIMARY KEY,
    user_id    INT REFERENCES users(user_id),
    total      DECIMAL(10,2),
    status     VARCHAR(20),
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE order_items (
    item_id    SERIAL PRIMARY KEY,
    order_id   INT REFERENCES orders(order_id),
    product_id INT REFERENCES products(product_id),
    quantity   INT,
    price      DECIMAL(10,2)
);

-- Join to get a user's order history with items:
SELECT u.name, o.order_id, o.total, oi.product_id, oi.quantity
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN order_items oi ON o.order_id = oi.order_id
WHERE u.user_id = 42;

Normalization

Normalization is the process of structuring data to eliminate redundancy. There are several normal forms, but the key principle is: each fact should be stored exactly once. If a user's city changes, you update one row in the users table rather than hunting through duplicated data across multiple tables.

1NF: Each column contains atomic values (no arrays or nested structures).
2NF: Every non-key column depends on the entire primary key, not just part of it.
3NF: No non-key column depends on another non-key column (no transitive dependencies).

The trade-off is that highly normalized data requires joins to reassemble, which can be expensive at scale. This tension between normalization and query performance is a recurring theme in system design.

SQL: The Relational Query Language

SQL is a declarative query language: you describe what data you want, and the query optimizer decides how to retrieve it. This is a powerful abstraction because it allows the database engine to change its execution strategy (e.g., choosing between a sequential scan and an index scan) without changing your query.

The Document Model

Document databases (MongoDB, CouchDB, Amazon DocumentDB) store data as self-contained documents, typically in JSON or BSON format. Each document contains all the data relevant to that entity, including nested objects and arrays.

Schema Flexibility

Document databases are often described as "schemaless," but a more accurate term is schema-on-read: the structure of the data is implicit, and only interpreted when the data is read. This contrasts with the relational model's schema-on-write, where the database enforces structure at write time.

Document Example (JSON){
  "_id": "user_42",
  "name": "Alice Chen",
  "email": "alice@example.com",
  "address": {
    "street": "123 Main St",
    "city": "San Francisco",
    "state": "CA",
    "zip": "94105"
  },
  "orders": [
    {
      "order_id": "ord_001",
      "total": 89.99,
      "status": "delivered",
      "items": [
        { "product": "Mechanical Keyboard", "qty": 1, "price": 89.99 }
      ]
    },
    {
      "order_id": "ord_002",
      "total": 245.00,
      "status": "shipped",
      "items": [
        { "product": "Monitor Stand", "qty": 1, "price": 45.00 },
        { "product": "USB-C Hub", "qty": 2, "price": 100.00 }
      ]
    }
  ]
}

Notice how a single document contains the user, their address, and all their orders with items. In a relational database, this would span four tables and require multiple joins to reassemble. The document model provides data locality: all the data you need is in one place, which can dramatically reduce the number of disk seeks for read-heavy access patterns.

When Documents Excel

One-to-many relationships: When an entity owns a tree of nested data (a user's orders, a blog post's comments), documents model this naturally.
Heterogeneous data: When different records have different fields (e.g., product catalogs where a laptop has CPU/RAM specs but a shirt has size/color).
Read-heavy workloads: When you usually load the entire entity at once, data locality avoids expensive joins.

When Documents Struggle

Many-to-many relationships: If orders reference shared products, or users reference shared organizations, you end up duplicating data or performing application-level joins.
Cross-document queries: Aggregations and joins across documents are limited and often require MapReduce or aggregation pipelines.
Deeply nested updates: Updating a specific item buried three levels deep in a nested array is awkward in most document databases.

The Graph Model

When your data is dominated by many-to-many relationships, neither the relational nor the document model is a natural fit. Graph databases (Neo4j, Amazon Neptune, TigerGraph) model data as vertices (nodes) connected by edges (relationships), making it easy to traverse complex webs of connections.

Property Graphs

In the property graph model, each vertex and edge can have an arbitrary set of key-value properties. This is the model used by Neo4j and most modern graph databases.

Property Graph StructureVertex (Node):
  - Unique ID
  - Set of outgoing edges
  - Set of incoming edges
  - Properties (key-value pairs)

Edge (Relationship):
  - Unique ID
  - Tail vertex (start node)
  - Head vertex (end node)
  - Label (relationship type)
  - Properties (key-value pairs)

Cypher Query Language

Cypher is a declarative query language for property graphs, used by Neo4j. It makes graph traversal queries concise and readable.

Cypher-- Find all friends-of-friends of Alice who live in New York:
MATCH (alice:Person {name: "Alice"})-[:FRIENDS_WITH]->()
      -[:FRIENDS_WITH]->(fof:Person)
WHERE fof.city = "New York" AND fof <> alice
RETURN DISTINCT fof.name;

-- Find the shortest path between two users:
MATCH path = shortestPath(
  (a:Person {name: "Alice"})-[:FRIENDS_WITH*]-(b:Person {name: "Bob"})
)
RETURN path;

Triple Stores and SPARQL

An alternative graph model is the triple store, where all information is stored as three-part statements: (subject, predicate, object). This model is used by RDF databases and is query-able using SPARQL.

SPARQL# Find all people who live in cities within the US:
SELECT ?person ?city WHERE {
  ?person :livesIn ?city .
  ?city   :locatedIn :UnitedStates .
}

Triple stores are common in knowledge graphs, semantic web applications, and situations where you need to integrate data from many heterogeneous sources under a unified schema.

Relational vs Document vs Graph: When to Use Which

Relational (PostgreSQL, MySQL)

Best for structured data with well-defined schemas.
Strong ACID transactions and referential integrity.
Powerful join operations for many-to-many relationships.
Mature ecosystem with decades of optimization.
Use when: e-commerce, banking, ERP, any OLTP workload with complex relationships and strict consistency.

Document (MongoDB, DynamoDB)

Best for semi-structured data with variable schemas.
Excellent data locality for entity-centric reads.
Scales horizontally with sharding out of the box.
Schema-on-read gives flexibility during rapid iteration.
Use when: content management, product catalogs, user profiles, mobile app backends, event logging.

Graph (Neo4j, Neptune)

Best for highly connected data with deep traversals.
Querying relationships is O(1) per hop, not O(n) like SQL joins.
Natural for recursive and variable-depth queries.
Weaker at bulk analytical processing compared to relational.
Use when: social networks, recommendation engines, fraud detection, knowledge graphs, network topology.

Decision Framework

Factor	Relational	Document	Graph
Primary data shape	Tabular, rows & columns	Nested JSON documents	Nodes & edges
Schema enforcement	Schema-on-write (strict)	Schema-on-read (flexible)	Schema-optional
Relationships	Foreign keys + joins	Embedded or references	First-class edges
Many-to-many support	Excellent (joins)	Poor (denormalization)	Excellent (traversals)
Query expressiveness	SQL (very expressive)	Query by example / MQL	Cypher / SPARQL / Gremlin
Horizontal scaling	Harder (sharding joins)	Native sharding	Challenging at scale
Transactions	Full ACID	Single-document ACID	Varies by product
Example use cases	Banking, ERP, e-commerce	CMS, catalogs, profiles	Social networks, fraud

Hybrid Approaches: The Best of Multiple Worlds

In practice, the boundaries between data models have blurred significantly. Modern databases increasingly support multiple models within a single system.

PostgreSQL JSONB

PostgreSQL added native JSONB support, allowing you to store semi-structured data alongside relational tables. You get the best of both worlds: ACID transactions, joins, and SQL for structured data, plus flexible JSON for fields that vary between records.

PostgreSQL JSONBCREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    name       VARCHAR(200) NOT NULL,
    category   VARCHAR(50),
    price      DECIMAL(10,2),
    attributes JSONB  -- flexible product-specific fields
);

-- Index into the JSON for fast queries:
CREATE INDEX idx_attributes_color
ON products USING GIN ((attributes->'color'));

-- Query combining relational and JSON:
SELECT name, price, attributes->>'color' AS color
FROM products
WHERE category = 'clothing'
  AND attributes @> '{"size": "L"}'
  AND price < 50.00;

Multi-Model Databases

ArangoDB: Supports document, graph, and key-value access in a single engine with a unified query language (AQL).
CosmosDB: Microsoft's globally distributed database offering document, graph, column-family, and key-value APIs over the same backend.
SurrealDB: A newer multi-model database supporting documents, graphs, and relational queries with built-in real-time capabilities.

Key Takeaway

There is no universally "best" data model. The right choice depends on your data's relationship structure, your query patterns, and your consistency requirements. Start by understanding the shape of your data: if it is tabular and well-structured, use relational. If it is hierarchical and entity-centric, use document. If it is a dense web of interconnections, use graph. And in many real systems, you will use more than one model for different parts of the application.

Query Language Paradigms

Declarative vs Imperative

SQL and Cypher are declarative: you state what result you want, and the database figures out how to compute it. This is advantageous because the query optimizer can automatically parallelize execution, choose optimal join strategies, and take advantage of indexes without changes to your queries.

In contrast, imperative approaches (like iterating over documents in application code to perform a join) give you more control but lose the optimization benefits. MapReduce sits between the two: it is neither fully declarative nor fully imperative.

MapReduce Querying

MapReduce, popularized by MongoDB's aggregation pipeline and Hadoop, processes large datasets by splitting work into map and reduce phases. While powerful for batch processing, it is being replaced by more declarative alternatives in most modern databases.

MongoDB Aggregation Pipeline// Count orders per city in the last 30 days:
db.orders.aggregate([
  { $match: { created_at: { $gte: new Date("2026-01-18") } } },
  { $lookup: {
      from: "users",
      localField: "user_id",
      foreignField: "_id",
      as: "user"
  }},
  { $unwind: "$user" },
  { $group: {
      _id: "$user.address.city",
      total_orders: { $sum: 1 },
      revenue: { $sum: "$total" }
  }},
  { $sort: { revenue: -1 } }
]);

Key Takeaways

The relational model excels at structured data with complex many-to-many relationships and strong consistency guarantees.
The document model provides data locality and schema flexibility, ideal for entity-centric access patterns with one-to-many relationships.
The graph model makes many-to-many relationships first-class citizens, enabling efficient traversals that would require expensive recursive joins in SQL.
Hybrid approaches like PostgreSQL JSONB and multi-model databases let you combine the strengths of multiple models in a single system.
Prefer declarative query languages (SQL, Cypher) over imperative data processing to let the database optimizer work for you.

Data Models & Query Languages