Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.5.1. Schema Design for Redshift, DynamoDB, and Lake Formation

šŸ’” First Principle: Each data store has different internal mechanics, so optimal schema design differs fundamentally. Redshift distributes data across nodes (distribution key matters), DynamoDB partitions by key (partition key design is everything), and Lake Formation manages access to S3-based tables (partition strategy drives query performance). Design for the engine, not for the textbook.

Redshift schema design:

Distribution styles. KEY distributes rows by a column's hash — co-locates rows with the same key on the same node, eliminating data movement for joins on that column. EVEN distributes rows round-robin across all nodes — good for tables that don't join frequently. ALL copies the entire table to every node — good for small dimension tables that join with large fact tables. AUTO lets Redshift choose based on table size.

Sort keys. Determine the physical order of rows on disk. Compound sort keys optimize queries that filter on a prefix of the key columns (like a phone book sorted by last name, then first name). Interleaved sort keys give equal weight to multiple columns — better for ad-hoc queries that filter on different columns.

DynamoDB schema design:

The partition key must distribute traffic evenly. High-cardinality attributes (user ID, order ID) work well. Low-cardinality attributes (status, country) create hot partitions. Composite sort keys enable multiple access patterns within a partition (e.g., SK=METADATA#, SK=ORDER#2025-03-15).

Lake Formation schema considerations. Lake Formation governs access to data lake tables. Column-level and row-level security policies affect how schemas should be designed — sensitive columns (PII) should be clearly identifiable so Lake Formation filters can be applied.

āš ļø Exam Trap: In Redshift, choosing the wrong distribution key for a large table forces expensive data redistribution during joins. If two large tables are frequently joined, their distribution key should be the join column. The exam loves this: "queries joining tables A and B on customer_id are slow" — the fix is setting customer_id as the distribution key for both tables.

Reflection Question: A Redshift data warehouse has a 10 billion row fact_sales table joined with a 500 row dim_store table. What distribution style should each table use, and why?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications