2.6. Pipeline Orchestration
š” First Principle: Consider a nightly pipeline with 15 dependent jobs ā without orchestration, a single upstream failure cascades silently, and downstream dashboards show stale data for hours before anyone notices. Orchestration is the difference between a collection of scripts and a reliable data pipeline. Like a conductor coordinating dozens of musicians who each play different instruments at different times, an orchestrator coordinates pipeline tasks ā ensuring dependencies are respected, failures trigger retries or alerts, and the whole system runs without human babysitting.
Without orchestration, a five-step pipeline (extract ā clean ā join ā aggregate ā load) requires manual sequencing. If step 2 fails, nobody notices until step 5 produces wrong results ā hours later. With orchestration, step 2's failure immediately stops the pipeline, retries the step, and alerts the engineering team if retries are exhausted. That visibility and control is what makes a pipeline production-grade.
The exam tests three orchestration approaches: Step Functions (AWS-native state machines), MWAA (managed Apache Airflow), and Glue Workflows (Glue-native). The choice depends on complexity, ecosystem integration, and team familiarity. How do you decide? If the pipeline is Glue-only, use Glue Workflows. If it coordinates multiple AWS services with branching logic, use Step Functions. If it has complex dependencies, external integrations, or the team knows Airflow, use MWAA.