3.3.1. AWS Glue Data Catalog and Crawlers
š” First Principle: The Glue Data Catalog is the schema registry for your data lake. Query engines (Athena, Redshift Spectrum, EMR) don't read S3 files directly ā they query tables in the Glue Data Catalog that describe where the files are, what columns they contain, and how they're partitioned. The catalog is the contract between your data files and your query tools.
The Glue Data Catalog organizes metadata hierarchically: databases contain tables, and tables describe the location, format, schema (columns and types), and partition structure of data in S3 or JDBC sources.
Glue crawlers automate catalog management. A crawler scans a data source (S3 path, JDBC database), samples the data to infer schema, and creates or updates tables in the catalog. Key exam concepts:
Schema inference. Crawlers detect column names, data types, and delimiters automatically. For S3, they identify the file format (CSV, Parquet, JSON, ORC) and sample records to determine types.
Partition discovery. When S3 data follows a Hive-style partition structure (year=2025/month=03/), crawlers automatically create partition entries in the catalog. New partitions are added on subsequent crawl runs.
Partition synchronization. For append-only data lakes where new partitions appear frequently, MSCK REPAIR TABLE (in Athena) or the Glue UpdatePartitions API can sync new partitions without running a full crawl ā faster and cheaper.
Connections. Crawlers can access JDBC databases (RDS, Aurora, Redshift), MongoDB, and other sources through Glue connections that define the endpoint, credentials, and VPC configuration.
ā ļø Exam Trap: Running crawlers on a schedule is simple but expensive for high-frequency updates. If partitions change hourly, running a crawler hourly wastes DPU-hours scanning unchanged data. Alternatives: use MSCK REPAIR TABLE, call BatchCreatePartition API directly, or use EventBridge to trigger partition registration only when new data arrives.
Reflection Question: Your data lake adds a new daily partition to 200 tables every night. Running 200 crawlers is expensive. What's a more efficient approach to keep the Glue Data Catalog in sync?