4.2.1. SQL Analysis with Athena and Redshift
š” First Principle: Athena and Redshift both execute SQL, but their architectures serve different query patterns. Athena is serverless and scans S3 directly ā ideal for infrequent, ad-hoc queries on data lake data with no upfront cost. Redshift maintains its own storage and query engine ā ideal for frequent, complex, concurrent queries where sub-second response time and caching matter.
Athena key features: partition projection (calculate partitions mathematically instead of querying the Glue Catalog), workgroups (isolate queries by team with separate cost tracking and limits), CTAS (create tables from query results), and federated query (query data in RDS, DynamoDB, and other sources via Lambda connectors). Views in Athena create reusable query abstractions without materializing data.
Redshift SQL supports stored procedures for complex transformation logic, materialized views for precomputed aggregation, and workload management (WLM) queues for prioritizing different query types. The v1.1 syllabus explicitly tests SQL in both Redshift and Athena.
| Decision Factor | Choose Athena | Choose Redshift |
|---|---|---|
| Query frequency | Occasional, ad-hoc | Frequent, repeated |
| Data location | S3 (data lake) | Loaded into Redshift or via Spectrum |
| Concurrency | Low-moderate | High (hundreds of users) |
| Latency | Seconds to minutes | Sub-second (cached) |
| Cost model | Per TB scanned | Per cluster-hour or RPU-second |
| Setup | None (serverless) | Cluster or serverless configuration |
ā ļø Exam Trap: Athena's MSCK REPAIR TABLE and partition projection both handle partitions but differently. MSCK updates the Glue Catalog by scanning S3 (slow for many partitions). Partition projection calculates partitions at query time without the catalog (fast, no crawl needed). For time-series data with many date partitions, partition projection is strongly preferred.
Reflection Question: An analytics team runs 20 standard reports every morning at 8 AM, each joining 3 tables with billions of rows. Throughout the day, they also run occasional ad-hoc queries on historical data in S3. What's the optimal architecture?