2.3. Scheduling, Triggers, and Ingestion Patterns
š” First Principle: A pipeline that runs manually isn't a pipeline ā it's a script. Real data engineering requires automated scheduling and event-driven triggers that keep data flowing without human intervention. Think of it like a factory assembly line: the line runs on a schedule, but quality alerts can stop the line at any moment. Data pipelines need both rhythms ā time-based ("run every hour") and event-based ("run when new data arrives").
Without proper scheduling, organizations depend on someone remembering to click "run" ā and when they're on vacation, the pipeline doesn't run. Without event-driven triggers, data sits unprocessed in S3 for hours or days until the next scheduled run. The exam tests your ability to choose between scheduling patterns and implement them with the right AWS services.
A critical concept the exam tests: idempotency. If a scheduled job runs twice due to a retry, does it produce correct results or corrupt data? Well-designed pipelines produce the same output regardless of how many times they run ā this is idempotent execution. Glue job bookmarks, DynamoDB conditional writes, and S3 object versioning are all mechanisms that support idempotency.