2.2.1. Spark Workspace Settings
đź’ˇ First Principle: Apache Spark distributes work across multiple "executors" (worker processes). Think of it like a restaurant kitchen: having 20 chefs when you only have 5 orders wastes money, but having 2 chefs during dinner rush means customers wait forever. Dynamic allocation adjusts your chef count based on actual orders.
Scenario: Your data engineering team runs Spark jobs with varying complexity—some process 1 GB of data, others process 1 TB. Fixed resource allocation means either wasting money on small jobs or failing on large ones.
Starter Pool vs. Custom Pools
- Starter Pool:
- Pre-warmed Spark cluster for fast session start (no cold start delay)
- Limited customization
- Best for: Development, small workloads, quick testing
- Custom Spark Pool:
- User-defined node sizes and counts
- Supports autoscaling
- Best for: Production workloads, specific resource requirements
Key Spark Configuration Options
| Setting | Purpose | When to Enable |
|---|---|---|
| Dynamic Allocation | Adjusts executor count based on workload | Always for variable workloads |
| Autoscale | Adjusts node count based on demand | Production environments with varying loads |
| Memory-Optimized Nodes | Higher memory ratio for data-intensive operations | Large dataset processing, complex transformations |
| Compute-Optimized Nodes | Better for CPU-bound operations | ML training, complex calculations |
Visual: Dynamic vs. Static Allocation
⚠️ Exam Trap: More partitions don't always improve performance. Excessive partitions create overhead from task scheduling and shuffle operations. Partition count should align with executor count and data volume—hundreds of tiny partitions can be slower than dozens of right-sized ones.
Session Tags for Session Reuse
- Concept: Reuse existing Spark sessions across pipeline activities to eliminate cold start latency (30-60 seconds per session)
- Implementation: Tag sessions in Data Factory pipeline Spark activities
- Use Case: Multiple notebook activities in a single pipeline run
Reflection Question: Your Spark jobs consistently use only 20% of allocated resources. What combination of settings would you adjust to optimize cost without risking job failures?