Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.4.1. AWS Glue ETL: DynamicFrames, Spark, and Job Bookmarks

šŸ’” First Principle: Glue ETL is the serverless Spark engine optimized for data lake transformations. It runs Apache Spark under the hood but adds features that Spark alone doesn't provide: DynamicFrames for schema flexibility, job bookmarks for incremental processing, and native integration with the Glue Data Catalog. When the exam says "serverless ETL" or "least operational overhead for transformation," Glue is almost always the answer.

DynamicFrames vs DataFrames. Standard Spark DataFrames require a consistent schema — every record must have the same columns and types. Glue's DynamicFrames handle schema inconsistency gracefully: records can have different columns, and the resolveChoice method lets you decide how to handle type conflicts (cast, project, or make_struct). This is critical for data lake workloads where source schemas drift.

Job bookmarks. Glue can track which data has already been processed. On subsequent runs, only new or modified data is processed — enabling incremental ETL. Bookmarks work with S3 (tracking new files by timestamp), JDBC sources (tracking new rows by a monotonically increasing column), and Kafka topics (tracking offsets). This eliminates reprocessing and dramatically reduces cost for pipelines that run frequently.

Glue job types for the exam:
Job TypeEngineUse CaseDPU
SparkApache SparkHeavy ETL, joins, aggregations, format conversion2–100 DPUs
Spark StreamingSpark Structured StreamingNear-real-time from Kinesis/Kafka2–100 DPUs
Python ShellPythonLightweight scripts, API calls, simple transforms0.0625–1 DPU
RayRayDistributed Python, ML feature engineering2+ DPUs

Glue Studio provides a visual ETL designer — drag-and-drop sources, transforms, and targets. It generates PySpark code that you can customize. For the exam, Glue Studio is the signal when questions mention "visual pipeline design" or "no-code ETL."

Cost optimization. Glue charges per DPU-second. Key optimizations: use Flex execution (pre-empt-able capacity, ~34% cheaper) for non-urgent jobs, enable auto-scaling to avoid over-provisioning, use column projection and predicate pushdown to read only necessary data, and convert CSV to Parquet early in the pipeline to reduce downstream scan costs.

āš ļø Exam Trap: Glue job bookmarks only work if the data source supports them. For S3, bookmarks track files — if you overwrite a file with the same name, the bookmark won't detect it as new. For JDBC, you must specify a bookmark key column. If a question mentions "reprocessing data that was already processed," check whether job bookmarks were properly configured.

Reflection Question: A Glue job reads CSV files from S3, converts them to Parquet, and writes to an output bucket. The job runs hourly but reprocesses all historical files every time. What Glue feature would fix this?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications