Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.2.2. AWS Glue Crawlers and Batch Jobs

šŸ’” First Principle: AWS Glue crawlers solve the "what's in my data lake?" problem automatically. Instead of manually defining table schemas for every dataset that lands in S3, a crawler examines the data, infers its schema (column names, data types, partitions), and registers it in the Glue Data Catalog — making the data instantly queryable by Athena, Redshift Spectrum, and EMR.

Crawlers are schema discovery agents. You point a crawler at an S3 path (or JDBC-accessible database), and it samples the data, detects the format (CSV, Parquet, JSON, etc.), infers column types, identifies partition structures (e.g., year=2024/month=03/), and creates or updates catalog tables. This eliminates one of the most tedious tasks in data lake management.

Crawlers for batch ingestion. In a typical pipeline: raw data lands in S3, a crawler runs to catalog the new data, and then a Glue ETL job (or Athena query) processes it using the catalog table. Crawlers can run on a schedule, on-demand, or triggered by EventBridge events.

Glue batch ETL jobs. Glue ETL jobs transform data at scale using Apache Spark. You write scripts in Python (PySpark) or Scala, and Glue manages the Spark cluster. Key features for the exam: job bookmarks track processed data so subsequent runs only process new records (incremental ETL), DynamicFrames provide a Glue-native abstraction over Spark DataFrames with built-in schema flexibility (resolveChoice for ambiguous types), and Glue Studio offers a visual ETL designer for no-code pipeline building.

Glue job types: Spark (for heavy transformations), Python Shell (for lightweight scripting without Spark overhead), and Ray (for distributed Python workloads). The exam typically tests Spark jobs for ETL and Python Shell for simple orchestration or API calls.

āš ļø Exam Trap: Glue crawlers don't transform data — they only discover and catalog schema. If a question asks about "transforming data and cataloging it," that requires both a Glue ETL job (transform) and a crawler (catalog). Don't confuse the two.

Reflection Question: Your data lake receives daily CSV files with slightly different column names across sources. How do Glue crawlers and DynamicFrames each address schema inconsistency?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications