2.4.1. AWS Glue ETL: DynamicFrames, Spark, and Job Bookmarks
š” First Principle: Glue ETL is the serverless Spark engine optimized for data lake transformations. It runs Apache Spark under the hood but adds features that Spark alone doesn't provide: DynamicFrames for schema flexibility, job bookmarks for incremental processing, and native integration with the Glue Data Catalog. When the exam says "serverless ETL" or "least operational overhead for transformation," Glue is almost always the answer.
DynamicFrames vs DataFrames. Standard Spark DataFrames require a consistent schema ā every record must have the same columns and types. Glue's DynamicFrames handle schema inconsistency gracefully: records can have different columns, and the resolveChoice method lets you decide how to handle type conflicts (cast, project, or make_struct). This is critical for data lake workloads where source schemas drift.
Job bookmarks. Glue can track which data has already been processed. On subsequent runs, only new or modified data is processed ā enabling incremental ETL. Bookmarks work with S3 (tracking new files by timestamp), JDBC sources (tracking new rows by a monotonically increasing column), and Kafka topics (tracking offsets). This eliminates reprocessing and dramatically reduces cost for pipelines that run frequently.
Glue job types for the exam:
| Job Type | Engine | Use Case | DPU |
|---|---|---|---|
| Spark | Apache Spark | Heavy ETL, joins, aggregations, format conversion | 2ā100 DPUs |
| Spark Streaming | Spark Structured Streaming | Near-real-time from Kinesis/Kafka | 2ā100 DPUs |
| Python Shell | Python | Lightweight scripts, API calls, simple transforms | 0.0625ā1 DPU |
| Ray | Ray | Distributed Python, ML feature engineering | 2+ DPUs |
Glue Studio provides a visual ETL designer ā drag-and-drop sources, transforms, and targets. It generates PySpark code that you can customize. For the exam, Glue Studio is the signal when questions mention "visual pipeline design" or "no-code ETL."
Cost optimization. Glue charges per DPU-second. Key optimizations: use Flex execution (pre-empt-able capacity, ~34% cheaper) for non-urgent jobs, enable auto-scaling to avoid over-provisioning, use column projection and predicate pushdown to read only necessary data, and convert CSV to Parquet early in the pipeline to reduce downstream scan costs.
ā ļø Exam Trap: Glue job bookmarks only work if the data source supports them. For S3, bookmarks track files ā if you overwrite a file with the same name, the bookmark won't detect it as new. For JDBC, you must specify a bookmark key column. If a question mentions "reprocessing data that was already processed," check whether job bookmarks were properly configured.
Reflection Question: A Glue job reads CSV files from S3, converts them to Parquet, and writes to an output bucket. The job runs hourly but reprocesses all historical files every time. What Glue feature would fix this?