Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.1.2. Amazon Redshift: The Cloud Data Warehouse

šŸ’” First Principle: Redshift exists because data lakes sacrifice query performance for flexibility. When your business needs sub-second dashboard refreshes on billions of rows with complex joins and aggregations, a data warehouse with columnar storage, massively parallel processing, and query optimization delivers what S3 + Athena cannot — consistent, fast performance under concurrent load.

Redshift's architecture enables high-performance analytics through columnar storage (reads only the columns a query needs), MPP (Massively Parallel Processing) (distributes work across multiple nodes), result caching (identical queries return cached results instantly), and compiled query plans (queries optimize on repeated execution).

Redshift deployment models:

Redshift Provisioned (RA3). You choose node types and cluster size. RA3 nodes separate compute from storage (using S3 as managed storage), allowing independent scaling. Best for predictable, high-concurrency workloads where you need control over capacity.

Redshift Serverless. No cluster management — you specify RPU (Redshift Processing Units) capacity, and Redshift auto-scales. Pay per compute-second. Best for variable or unpredictable workloads, development environments, and teams that want zero cluster management.

Redshift Spectrum. Queries data directly in S3 using external tables defined in the Glue Data Catalog — without loading it into Redshift. This extends your warehouse to the data lake. Use Spectrum when data is too large to load entirely into Redshift or when it's queried infrequently.

Federated queries. Query data in Amazon RDS, Aurora, or other Redshift clusters directly from Redshift without copying data. Useful for joining warehouse data with transactional data in real-time.

Materialized views. Precomputed query results stored in Redshift, refreshed on demand or incrementally. Dramatically accelerates dashboards that run the same expensive aggregation repeatedly.

FeatureRedshift ProvisionedRedshift Serverless
ManagementCluster sizing, scaling policiesAuto-managed
PricingPer node-hour (reserved discounts)Per RPU-second
ScalingManual or scheduledAutomatic
Best forPredictable, steady workloadsVariable workloads, dev/test

āš ļø Exam Trap: "Use Redshift Spectrum" and "load data into Redshift" are different answers. Spectrum queries S3 data in place (no loading), while COPY loads data into Redshift managed storage. If a question says "query data without loading it" or "extend the data warehouse to the data lake," the answer is Spectrum. If it says "optimize query performance for frequently accessed data," the answer is loading with COPY.

Reflection Question: A BI team runs 50 dashboard queries per hour against 2 TB of sales data, but also needs ad-hoc access to 100 TB of historical log data. How would you architect this with Redshift?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications