2.2. Batch Data Ingestion
š” First Principle: Batch ingestion trades latency for simplicity. Like scheduling a weekly grocery delivery instead of running to the store every time you need something, batch processing collects data over a period and processes it all at once ā simpler to build, cheaper to operate, and sufficient for the vast majority of analytics workloads.
While streaming gets the headlines, batch processing does the heavy lifting in most data architectures. Nightly ETL jobs, weekly report generation, monthly data warehouse refreshes ā these are batch patterns, and they dominate because most business decisions don't need sub-second data freshness. The question "how did we perform last quarter?" doesn't change if the answer arrives 5 minutes sooner.
Without well-designed batch ingestion, organizations resort to manual exports ā someone running a query, saving a CSV, and uploading it to a shared drive. This breaks down at scale and introduces errors. Automated batch ingestion replaces human error with repeatable, auditable, fault-tolerant pipelines.
The exam tests batch ingestion heavily because it's the most common pattern in real-world data engineering. Expect scenarios involving S3 as a landing zone, Glue crawlers discovering schema, DMS migrating databases, and AppFlow pulling SaaS data.