2.6.1. AWS Step Functions and State Machines
š” First Principle: Step Functions models your pipeline as a state machine ā a series of states connected by transitions, where each state can invoke an AWS service, make a decision, run tasks in parallel, or handle errors. It's the AWS-native answer to "I need my pipeline steps to run in a specific order with error handling," without writing any orchestration code yourself.
A Step Functions workflow (state machine) is defined in Amazon States Language (JSON). Key state types:
Task invokes an AWS service ā start a Glue job, invoke a Lambda function, run an ECS task, query DynamoDB, or call any of 200+ AWS SDK integrations. Direct service integrations are preferred over Lambda wrappers because they reduce cost and complexity.
Choice adds conditional branching ā "if the Glue job returned 0 records, skip the load step." This enables pipelines that adapt to data conditions.
Parallel runs multiple branches simultaneously ā useful for processing independent datasets concurrently, then merging results.
Map iterates over a collection ā process each file in a list, each partition in a dataset, or each record in an array. Distributed Map mode can process millions of items by fanning out to concurrent child executions.
Wait introduces delays ā useful for polling external systems or rate-limiting downstream services.
Two execution models: Standard workflows support long-running executions (up to 1 year), cost per state transition, and support all state types. Express workflows support high-volume, short-duration executions (up to 5 minutes), cost per execution and duration, and are suited for event processing. For data pipelines, Standard is almost always the right choice.
ā ļø Exam Trap: Step Functions charges per state transition (Standard) or per execution (Express). A state machine with 1,000 iterations in a Map state incurs 1,000+ transitions. For high-volume iteration, consider Distributed Map mode (processes items in batches) or an alternative like Lambda with SQS. The exam may present a cost optimization question where the answer is restructuring the state machine to reduce transitions.
Reflection Question: A pipeline has 5 sequential steps: extract from RDS, clean with Lambda, convert to Parquet with Glue, load into Redshift, then send a notification via SNS. If any step fails, the pipeline should retry twice then alert the team. Which Step Functions features handle this?