4.3.1. Pipeline Optimization
💡 First Principle: Pipeline optimization focuses on reducing execution time and resource cost. The key levers are parallelism, throughput settings, and minimizing unnecessary data movement—like optimizing a delivery route by sending multiple trucks simultaneously and only shipping what's needed.
Scenario: A nightly pipeline loads data from 12 source tables into a lakehouse. Running each Copy activity sequentially takes 3 hours. The business SLA is 1 hour.
Copy Activity Throughput Settings
| Setting | Purpose | Impact |
|---|---|---|
| Data Integration Units (DIUs) | Compute power allocated to Copy activity | Higher DIUs = faster copy, higher cost |
| Degree of Copy Parallelism | Number of concurrent threads within a Copy | Speeds up large file/table copies |
| Auto DIU | Let Fabric choose optimal DIUs | Good default for most scenarios |
DIU Selection Guidance:
| Data Volume | Recommended DIU | Notes |
|---|---|---|
| < 1 GB | 4 (minimum) | Auto is fine |
| 1–10 GB | 8–16 | Manual may help |
| 10–100 GB | 16–64 | Definitely tune |
| > 100 GB | 64–256 | Monitor throughput |
Parallel Execution Patterns
Techniques for parallel execution:
- Independent Copy activities: Activities with no dependencies run in parallel by default
- ForEach with sequential = false: Process multiple items concurrently
- Batch count: Control maximum concurrency in ForEach (default 20, max 50)
Staging Patterns
- Direct copy (no staging): Source → Destination directly. Fastest when source/destination are both cloud-based
- Staging via Blob Storage: Source → Staging → Destination. Required for some on-premises to cloud scenarios and enables PolyBase loading
- When to stage: On-premises sources, format conversion needed, or when PolyBase dramatically improves load speed
Pipeline Optimization Checklist
| Optimization | When to Apply | Expected Impact |
|---|---|---|
| Parallelize independent activities | Multiple source tables | 2-10x faster |
| Increase DIUs | Large data volumes | 2-4x faster per activity |
| Use ForEach with batching | Repeating patterns across items | Reduces pipeline complexity |
| Filter at source (query) | Large tables with partial needs | Less data moved |
| Use COPY INTO instead of Copy activity | Bulk warehouse loads | Faster, less overhead |
| Enable staging for on-prem | On-premises to cloud loads | Required for some connectors |
⚠️ Exam Trap: Setting ForEach sequential = true forces sequential execution even when items are independent. Questions about "pipeline runs slower than expected" with ForEach often test whether you know to set sequential = false and increase batchCount.
⚠️ Common Pitfall: Increasing DIUs beyond what the source can serve. If the source database throttles at 100 MB/s, setting 256 DIUs wastes money—the bottleneck is the source, not the copy engine.