4.3.4. Data Warehouse Query Optimization
💡 First Principle: Warehouse query performance depends on statistics accuracy, caching, and query structure. The query optimizer makes execution plan decisions based on statistics—outdated statistics lead to suboptimal plans, like navigating with a map from last year. Without proper optimization, queries that should take seconds take minutes as data grows.
Scenario: A reporting query joins three large tables and suddenly takes 10x longer after a bulk data load. The statistics haven't been updated since the load, so the optimizer underestimates row counts and chooses a nested loop join instead of a hash join.
Statistics Management
Statistics help the query optimizer estimate result sizes and choose efficient execution plans:
| Action | Command | When to Use |
|---|---|---|
| Update all statistics | UPDATE STATISTICS dbo.TableName | After bulk loads |
| Create specific stats | CREATE STATISTICS stat_name ON dbo.Table(Column) | Frequently filtered columns |
| Auto-create statistics | Enabled by default | Good for most scenarios |
| View statistics health | DBCC SHOW_STATISTICS('dbo.Table', 'stat_name') | Diagnosing slow queries |
When statistics go stale:
- After loading >20% new data into a table
- After significant DELETE operations
- After schema changes (new columns)
Result Set Caching
Fabric warehouses automatically cache query results:
- How it works: Identical queries return cached results without re-executing
- Cache duration: Results cached until underlying data changes
- Benefit: Repeated dashboard queries return instantly
- Limitation: Only benefits identical queries—parameter changes miss the cache
Query Performance Best Practices
| Practice | Why | Example |
|---|---|---|
| Filter early (WHERE clause) | Reduce data scanned | WHERE date > '2024-01-01' before joining |
| **Avoid SELECT *** | Reads unnecessary columns | Specify only needed columns |
| Use appropriate JOINs | Wrong join = wrong results or poor performance | INNER vs LEFT based on requirements |
| Minimize DISTINCT | Forces expensive sorting | Redesign query or fix upstream duplicates |
| Use CTEs for readability | Optimizer flattens them anyway | No performance cost, easier to debug |
Materialized Views (Preview)
- Purpose: Pre-computed results that auto-refresh when source data changes
- Benefit: Complex aggregations return instantly
- Trade-off: Storage cost for pre-computed results
- Best for: Dashboards querying the same aggregations repeatedly
⚠️ Exam Trap: Result set caching is automatic—you don't need to enable it. Questions about "improving repeated query performance" are not asking you to configure caching. Instead, they may be testing statistics updates or query structure improvements.
⚠️ Common Pitfall: Running UPDATE STATISTICS on every table daily regardless of data changes. Only update statistics on tables that received significant data modifications.