Two Paradigms of Data Processing
There are fundamentally two ways to process data at scale:
Batch Processing
- Processes a bounded dataset: a fixed, finite collection of data that is fully available before processing begins.
- Optimized for throughput: process as much data as possible, as efficiently as possible.
- Latency is measured in minutes to hours.
- Examples: daily report generation, training ML models, building search indexes, ETL pipelines.
- Key systems: MapReduce, Apache Spark, Apache Flink (batch mode), dbt.
Stream Processing
- Processes an unbounded dataset: events arrive continuously and are processed as they come.
- Optimized for latency: react to events as quickly as possible.
- Latency is measured in milliseconds to seconds.
- Examples: fraud detection, real-time dashboards, alerting, recommendation updates, CDC.
- Key systems: Apache Kafka, Apache Flink, Apache Kafka Streams, Apache Pulsar.
Batch Processing: MapReduce
MapReduce, introduced by Google in 2004, is the foundational batch processing model. It processes large datasets by distributing work across a cluster of commodity machines.
The MapReduce Programming Model
Every MapReduce job consists of two user-defined functions:
- Map: Takes an input record and emits zero or more key-value pairs. The mapper extracts the relevant information and assigns it a grouping key. Mappers run in parallel across input splits.
- Shuffle & Sort: The framework collects all values for the same key and sorts them. This is the most I/O-intensive phase and is handled automatically.
- Reduce: Takes a key and all its associated values, then produces the final output. Reducers also run in parallel, one per key partition.
// Map function: emits (word, 1) for each word in the document
function map(document):
for each word in document:
emit(word, 1)
// Reduce function: sums all counts for each word
function reduce(word, counts):
total = sum(counts)
emit(word, total)
// Input: "the cat sat on the mat"
// Map: (the,1) (cat,1) (sat,1) (on,1) (the,1) (mat,1)
// Shuffle: {the: [1,1], cat: [1], sat: [1], on: [1], mat: [1]}
// Reduce: (the,2) (cat,1) (sat,1) (on,1) (mat,1)
Sort-Merge Joins in MapReduce
Joining two large datasets is a common batch operation. MapReduce supports several join strategies:
- Reduce-side join (sort-merge join): Both datasets are mapped with the join key. The shuffle phase groups records by key. The reducer receives all records for the same key from both datasets and can perform the join. This is general-purpose but requires a full shuffle of both datasets.
- Map-side join (broadcast join): If one dataset is small enough to fit in memory, it is loaded into every mapper. Each mapper can join its input partition with the in-memory dataset without a shuffle. Much faster but limited by memory.
- Partitioned hash join: If both datasets are already partitioned by the join key, each mapper only needs the corresponding partition of the other dataset. This avoids a full shuffle while handling larger datasets than broadcast joins.
Limitations of MapReduce
- Materialization between stages: Every MapReduce job writes its output to HDFS. Multi-stage pipelines read and write to disk between stages, creating enormous I/O overhead.
- No iteration: Algorithms that require multiple passes over the data (e.g., PageRank, k-means clustering) must chain separate MapReduce jobs, each writing and reading intermediate results.
- High latency: Job startup, scheduling, and disk I/O mean even simple jobs take minutes.
Modern Batch Processing
The limitations of MapReduce led to a new generation of batch processing frameworks that use in-memory computation and more flexible dataflow models.
Apache Spark
Spark introduces the concept of Resilient Distributed Datasets (RDDs): immutable, partitioned collections of records that can be cached in memory across the cluster. Key advantages:
- In-memory processing: Intermediate results are kept in memory, eliminating the disk I/O bottleneck of MapReduce. Spark can be 10-100x faster for iterative algorithms.
- DAG execution: Instead of rigid map-reduce stages, Spark builds a Directed Acyclic Graph (DAG) of transformations and optimizes the entire pipeline before execution.
- Rich API: Supports map, filter, join, group-by, window functions, and SQL queries through Spark SQL and DataFrames.
- Unified engine: Batch, streaming (Spark Structured Streaming), ML (MLlib), and graph processing (GraphX) on a single platform.
Apache Flink (Batch Mode)
Flink was designed as a stream processor first, but can also run batch jobs by treating bounded datasets as a special case of streams. This "stream-first" philosophy means Flink can handle both batch and streaming workloads with the same API and engine.
Dataflow Graphs
Modern batch frameworks represent computations as dataflow graphs where:
- Nodes represent operators (map, filter, join, aggregate).
- Edges represent data flow between operators.
- The framework optimizes the graph (operator fusion, partition pruning, predicate pushdown) before executing it.
This is a strict generalization of MapReduce: a MapReduce job is just a two-node dataflow graph.
| Feature | MapReduce | Spark | Flink |
|---|---|---|---|
| Execution model | Disk-based, stage-by-stage | In-memory DAG | Streaming dataflow |
| Iteration support | Chained jobs (slow) | Native (cache RDDs) | Native (iterative dataflows) |
| Latency | Minutes | Seconds to minutes | Milliseconds to seconds |
| Streaming support | None | Micro-batch (Structured Streaming) | True streaming (event-at-a-time) |
| Fault tolerance | Re-run failed tasks (data on HDFS) | RDD lineage (recompute from source) | Checkpointing (snapshots) |
| Primary use case | Legacy ETL | General-purpose analytics | Real-time + batch unified |
Stream Processing: Event Logs & Kafka
Stream processing centers on the concept of an event log: an append-only, ordered sequence of records, each representing something that happened (a user clicked, a payment was made, a sensor reading was taken).
Apache Kafka Architecture
Kafka is the dominant event streaming platform. It is a distributed, partitioned, replicated commit log.
Key Kafka Concepts
- TopicA named, logical feed of events. Topics are divided into partitions for parallelism. Example topics:
orders,user-clicks,payment-events. - PartitionAn ordered, immutable, append-only log of records. Each record within a partition has a unique offset. Partitions are the unit of parallelism: more partitions means more consumers can read concurrently.
- OffsetA monotonically increasing integer assigned to each record in a partition. Consumers track their position by storing their current offset. This allows replay (re-read from an earlier offset) and exactly-once processing.
- ProducerPublishes records to topics. The producer chooses the partition for each record, either by key hash (records with the same key always go to the same partition) or round-robin.
- Consumer GroupA set of consumers that cooperatively read from a topic. Each partition is assigned to exactly one consumer in the group. If a consumer fails, its partitions are rebalanced to the remaining consumers. Multiple consumer groups can read the same topic independently.
- BrokerA Kafka server that stores partitions. A Kafka cluster consists of multiple brokers. Each partition has one leader broker (handles reads and writes) and zero or more follower brokers (replicas for fault tolerance).
- RetentionKafka retains messages for a configurable duration (default: 7 days) or size. Unlike traditional message queues, consuming a message does not delete it. Multiple consumers can read the same data at different paces.
Event Sourcing & CQRS
Event sourcing is an architectural pattern that stores the full history of state changes as an immutable log of events, rather than storing only the current state.
Event Sourcing
Instead of updating a row in a database (UPDATE account SET balance = 90 WHERE id = 1), you append an event to the log:
{ "event": "AccountCreated", "id": 1, "balance": 0, "timestamp": "..." }
{ "event": "MoneyDeposited", "id": 1, "amount": 100, "timestamp": "..." }
{ "event": "MoneyWithdrawn", "id": 1, "amount": 10, "timestamp": "..." }
// Current state: balance = 90 (derived by replaying events)
Advantages of event sourcing:
- Complete audit trail: Every change is recorded. You can answer "how did we get here?" for any piece of data.
- Temporal queries: You can reconstruct the state at any point in time by replaying events up to that moment.
- Decoupled consumers: Different services can derive different views (projections) from the same event stream.
- Bug fixing: If a bug corrupted a derived view, you can fix the code and replay events to rebuild the correct state.
CQRS (Command Query Responsibility Segregation)
CQRS separates the write model (commands) from the read model (queries). This pairs naturally with event sourcing:
- Write side: Accepts commands, validates them, and appends events to the event log.
- Read side: Consumes events and builds optimized read models (materialized views) tailored to specific query patterns.
- The read and write models can use different databases, different schemas, and scale independently.
Change Data Capture (CDC)
CDC is the process of capturing row-level changes (inserts, updates, deletes) from a database and streaming them as events to downstream consumers. It bridges the gap between traditional databases and event-driven architectures.
How CDC Works
- Log-based CDC (preferred): Reads the database's write-ahead log (WAL/binlog/redo log) to capture every committed change. This is non-invasive (no application code changes, minimal performance impact) and captures all changes, including those made by direct SQL.
- Trigger-based CDC: Database triggers write changes to a shadow table. More portable but adds write overhead and complexity.
- Polling-based CDC: Periodically queries for rows with a modified timestamp greater than the last poll. Simple but misses deletes, has higher latency, and creates query load.
Tools: Debezium (open source, log-based CDC for Postgres, MySQL, MongoDB, etc.) streams changes to Kafka topics. AWS DMS and GCP Datastream provide managed CDC services.
CDC Use Cases
- Keeping a search index (Elasticsearch) in sync with a primary database.
- Feeding a data warehouse or data lake with real-time data.
- Maintaining materialized views in a different database.
- Triggering microservice workflows when specific data changes.
Stream Joins, Windowing & Watermarks
Stream processing goes beyond simple per-event transformations. Real applications require aggregations, joins, and time-based operations on unbounded data.
Stream Joins
- Stream-Stream JoinJoins two event streams within a time window. Example: join click events with impression events within a 1-hour window to calculate click-through rates. Both sides are buffered for the window duration.
- Stream-Table Join (Enrichment)Enriches each event in a stream with data from a table (or a compacted stream that represents a table). Example: join order events with a customer table to add customer names. The table side is maintained as a local state store.
- Table-Table JoinMaintains a continuously updated join of two tables (represented as compacted streams). Used for materialized views that need to reflect the latest state of both tables.
Windowing
Windowing groups unbounded events into finite buckets for aggregation:
- Tumbling window: Fixed-size, non-overlapping windows. Example: count events every 5 minutes. Each event belongs to exactly one window.
- Hopping window: Fixed-size, overlapping windows. Example: 10-minute windows that advance every 5 minutes. An event may fall in multiple windows.
- Sliding window: Contains all events within a time range of each other. Used for "events in the last 10 minutes" queries.
- Session window: Groups events by activity sessions. A new window starts when there is a gap larger than the timeout. Used for user session analysis.
Watermarks
In stream processing, events can arrive out of order. An event with timestamp 10:01 might arrive after an event with timestamp 10:03 due to network delays. Watermarks solve this problem:
- A watermark is a timestamp W that asserts: "All events with timestamp ≤ W have arrived."
- When the watermark advances past the end of a window, the window can be closed and results emitted.
- Events that arrive after the watermark has passed their window are called late events. Systems handle these by either dropping them, emitting corrections, or using an allowed lateness threshold.
Exactly-Once Semantics
Processing guarantees in stream systems come in three levels:
- At-most-once: Events may be lost but are never processed twice. Simplest to implement (fire and forget).
- At-least-once: Every event is processed, but some may be processed more than once (on retry after failure). Requires idempotent consumers or deduplication.
- Exactly-once: Every event is processed exactly one time, even in the presence of failures. The hardest guarantee to provide.
How Kafka achieves exactly-once:
- Idempotent producer: Each producer is assigned a PID (producer ID). Each message is tagged with a sequence number. The broker deduplicates messages with the same PID and sequence number, preventing duplicate writes even if the producer retries.
- Transactional producer: Allows atomic writes across multiple partitions. Either all messages in a transaction are committed, or none are. Consumers configured with
read_committedisolation only see committed messages. - Consumer offset commits: By atomically committing consumer offsets and output messages in the same transaction, the system guarantees that processing and offset advancement happen together.
Key Takeaways
- Batch processing operates on bounded datasets for throughput; stream processing operates on unbounded datasets for low latency. Modern frameworks like Flink unify both.
- MapReduce is the foundational batch model but suffers from disk I/O overhead. Spark and Flink improve on it with in-memory DAG execution.
- Kafka is a distributed commit log with topics, partitions, and consumer groups. It decouples producers from consumers and enables replay.
- Event sourcing stores all changes as an immutable event log. CQRS separates read and write models. Both add power but also complexity.
- CDC bridges traditional databases and event streaming by capturing row-level changes from the database log.
- Windowing and watermarks are essential for aggregating and ordering events in stream processing. Late events require explicit handling strategies.
- Exactly-once semantics in Kafka rely on idempotent producers, transactions, and atomic offset commits, but end-to-end exactly-once requires application-level design.