1.3.2. Schema-on-Read vs Schema-on-Write: A Foundational Decision
š” First Principle: Where you enforce data structure ā at write time or at read time ā determines your entire architecture. Schema-on-write means data must conform before storage (like loading into Redshift). Schema-on-read means data is stored raw and structure is applied when queried (like Parquet files in S3 queried by Athena). Each approach serves different needs, and the exam tests your ability to choose.
In a schema-on-write system, data is validated, cleaned, and structured before it lands in the target store. Think of loading data into a Redshift table ā the data must match the table's column definitions, data types, and constraints, or the load fails. This approach guarantees consistency: every query gets clean, well-structured data. The downside? It's rigid. When business requirements change and you need a new column or a different aggregation, you have to modify the pipeline, reload the data, and wait.
In a schema-on-read system, data is stored in its raw or semi-structured form ā JSON files, CSV dumps, or Parquet files in S3 ā and the schema is applied when you query it. Athena does this naturally: you define an external table that describes the data's structure, and Athena interprets the files at query time. This approach gives maximum flexibility ā store everything, figure out how to use it later. The downside? Queries may encounter unexpected data, nulls, or schema drift, and you need governance to prevent the data lake from becoming a data swamp.
| Aspect | Schema-on-Write | Schema-on-Read |
|---|---|---|
| When structured | Before storage | At query time |
| Data quality | Guaranteed at write | Must validate separately |
| Flexibility | Rigid ā schema changes require pipeline changes | Flexible ā raw data always available |
| Query performance | Predictable, often faster | Depends on format and partitioning |
| AWS Services | Redshift, RDS/Aurora, DynamoDB | S3 + Athena, S3 + Glue Catalog, Lake Formation |
| Best for | Structured analytics, dashboards, BI | Data lakes, exploratory analysis, ML |
Most modern AWS architectures use both. Raw data lands in S3 (schema-on-read), gets transformed and loaded into Redshift (schema-on-write) for dashboards, while data scientists also query the raw S3 data via Athena for exploratory work. This "lakehouse" pattern appears frequently on the exam.
ā ļø Exam Trap: If a question describes an organization that needs to store raw data for future unknown use cases AND provide structured reporting today, the answer typically involves S3 (raw storage) plus a transformation pipeline into Redshift or a governed layer in Lake Formation ā not one or the other.
Reflection Question: A healthcare company must store all patient data for 7 years for compliance but only knows what reports they need for the current quarter. Which approach ā schema-on-read, schema-on-write, or both ā best serves this requirement?