3.5. Data Modeling and Schema Evolution
š” First Principle: Imagine adding a new column to a table with 10 billion rows in a traditional warehouse ā it could take hours of downtime. A data model is a blueprint that determines how efficiently data serves its consumers ā like an architect designing a building for its intended purpose. A warehouse optimized for bulk scanning (star schema in Redshift) is structured completely differently from a database optimized for single-record lookups (DynamoDB with a well-designed partition key). Getting the model wrong means either rebuilding later or living with poor performance forever.
Without intentional data modeling, teams dump raw data into tables and hope for the best. The consequence: Redshift queries run for minutes instead of seconds because distribution keys don't align with join patterns. DynamoDB requests spike in cost because a hot partition key concentrates traffic on a single node. Schema changes break downstream consumers because nobody tracked dependencies.
The exam tests modeling decisions for Redshift, DynamoDB, and Lake Formation, as well as schema evolution patterns (how to change schemas without breaking things) and data lineage (tracking where data came from and how it was transformed). What trade-offs do you accept when choosing a star schema over a normalized schema? When does denormalization help, and when does it create maintenance nightmares?