2.5.2. Format Selection and Conversion Patterns
š” First Principle: Format conversion should happen as early in the pipeline as possible ā the earlier you convert to an efficient format, the more downstream processes benefit. The raw landing zone stores data in whatever format it arrives (CSV, JSON), but the curated zone should always use a columnar, compressed format.
The common data lake pattern uses three zones:
Raw zone: Data lands in its original format. No transformation. This preserves the original data for reprocessing and auditing.
Curated zone: Data is cleaned, typed, and converted to Parquet or ORC with compression. Partitioned by commonly filtered columns (date, region, product category). This is where Glue ETL or EMR Spark jobs do the heavy lifting.
Serving zone: Data is further aggregated or denormalized for specific consumption patterns ā dashboards, ML feature stores, or API responses. May live in Redshift, DynamoDB, or a dedicated S3 prefix.
Conversion services: Glue ETL jobs are the primary tool for format conversion. A typical Glue job reads CSV/JSON from the raw zone, applies transformations (data type casting, null handling, deduplication), writes Parquet to the curated zone with Snappy compression, and partitions by date. Firehose can also convert to Parquet on delivery using its built-in format conversion feature ā simpler for streaming ingestion.
ā ļø Exam Trap: Firehose's built-in Parquet conversion requires a Glue Data Catalog table to define the target schema. If the question mentions Firehose delivering Parquet to S3, a Glue table must exist. This is a common trick question ā candidates forget the Glue dependency.
Reflection Question: Why should the raw zone keep data in its original format even though columnar formats are more efficient? In what scenario would you need to reprocess from the raw zone?