1.1.3. Batch vs Streaming: Two Paradigms for Data Movement
š” First Principle: The choice between batch and streaming isn't about technology ā it's about how quickly your business needs to act on the data. If the answer to "what happens if this data arrives an hour late?" is "nothing important," batch is your friend. If the answer is "we lose money every second," you need streaming.
This is arguably the most fundamental architectural decision in data engineering, and the exam tests it repeatedly from different angles. Understanding the trade-offs will help you answer correctly even on services you haven't memorized.
| Characteristic | Batch | Streaming |
|---|---|---|
| Latency | Minutes to hours | Milliseconds to seconds |
| Volume | Large, bounded datasets | Continuous, unbounded flow |
| Processing | Process all at once | Process as it arrives |
| Complexity | Simpler to build and debug | More complex (ordering, deduplication, late data) |
| Cost | Often lower (process during off-peak) | Often higher (always running) |
| AWS Services | S3, Glue, EMR, Lambda, Redshift COPY | Kinesis Streams, Firehose, MSK, Managed Flink |
| Use Cases | Nightly ETL, historical analysis, reporting | Fraud detection, real-time dashboards, IoT alerts |
There's also a middle ground ā micro-batching ā where data arrives continuously but is processed in small intervals (every 1ā5 minutes). Kinesis Data Firehose operates this way, buffering records before delivering them to S3 or Redshift. Many exam scenarios live in this middle ground, so watch for latency requirements in the question stem that might say "near-real-time" (hinting at Firehose) rather than "real-time" (hinting at Kinesis Data Streams or MSK).
The exam also tests replayability ā the ability to re-process historical data. Batch systems are inherently replayable because you can always re-read the source files. Streaming systems require intentional design: Kinesis retains data for 24 hours by default (up to 365 days), and MSK can retain data indefinitely using tiered storage. If a question mentions needing to "reprocess last week's data," that's a replayability signal.
ā ļø Exam Trap: "Real-time" in an exam question doesn't always mean Kinesis Data Streams. If the scenario requires delivering data to S3 or Redshift in near-real-time with minimal code, Kinesis Data Firehose is often the better answer. Read the latency requirements carefully.
Reflection Question: An e-commerce company processes order data nightly for inventory reports, but wants to detect fraudulent transactions within seconds. Which paradigm serves each use case? Can a single pipeline serve both?