4.3.2. SageMaker Pipelines and Workflow Orchestration
SageMaker Pipelines is the exam's primary ML workflow tool. Each pipeline consists of steps — ProcessingStep for data preparation, TrainingStep for model training, ConditionStep for branching logic (e.g., deploy only if accuracy > 0.9), and RegisterModel for pushing to the Model Registry. Step caching is a crucial optimization: if a step's inputs haven't changed, Pipelines reuses the previous output instead of re-executing. This saves hours on expensive training steps during iterative development. Parameterization makes pipelines reusable — define variables like instance type, training data path, and hyperparameters as pipeline parameters rather than hardcoding values. The exam tests whether you can design a pipeline that handles retraining triggers, quality gates, and conditional deployment in a single workflow.
💡 First Principle: SageMaker Pipelines is the ML-native orchestration service—it understands ML concepts (training jobs, model registration, endpoint creation) natively, unlike general-purpose orchestrators that treat these as generic compute steps. This ML-awareness provides built-in caching, lineage tracking, and parameter management—but it's SageMaker-specific. For broader orchestration needs, consider Step Functions or Apache Airflow (MWAA).
| Orchestrator | ML-Native | Scope | Best For |
|---|---|---|---|
| SageMaker Pipelines | Yes | SageMaker-centric workflows | End-to-end ML: data prep → train → evaluate → register → deploy |
| AWS Step Functions | No | General AWS service orchestration | Workflows combining ML and non-ML services |
| Amazon MWAA (Airflow) | No | General orchestration | Teams using Airflow, complex DAGs, non-AWS integrations |
| Amazon EventBridge | No | Event-driven triggers | Triggering pipelines on schedule or event (new S3 data, model drift alert) |
SageMaker Pipelines features:
- Step caching: If input data and parameters haven't changed, skip re-running a step
- Pipeline parameters: Configurable values (instance type, data path) that vary per run
- Conditional steps: Branch logic based on model metrics (deploy only if accuracy > threshold)
- Lineage tracking: Automatic tracking of which data and parameters produced which model
EventBridge integration is critical for triggering retraining. Common patterns: an EventBridge rule monitors an S3 bucket for new training data, triggers a SageMaker Pipeline when data arrives. Another rule monitors Model Monitor for drift alerts and triggers retraining when drift exceeds a threshold.
⚠️ Exam Trap: SageMaker Pipelines and AWS CodePipeline are different services. SageMaker Pipelines orchestrates ML workflow steps within SageMaker (training, processing, evaluation). CodePipeline orchestrates software delivery (build, test, deploy). For a complete MLOps setup, you often use both: SageMaker Pipelines for the ML workflow and CodePipeline for the CI/CD wrapper that triggers it.
Reflection Question: A company wants to automatically retrain their model when new data lands in S3, evaluate it, and deploy it only if accuracy exceeds 90%. Which combination of AWS services implements this?