2.2.2. Data Workflow Workspace Settings
đź’ˇ First Principle: Think of Data Workflow settings like the traffic lights controlling how many cars can enter a highway at once. Too many concurrent dataflows fight for the same compute resources, causing everyone to slow down. Too few, and you waste time waiting in a queue while resources sit idle. The art is finding the right throughput for your capacity.
Scenario: Your data engineering team runs 20 Dataflow Gen2 refreshes concurrently during the nightly batch window. Without proper workspace settings, some dataflows queue for extended periods while others consume disproportionate resources, and everything takes longer than it should.
Understanding Data Workflow Settings
Data Workflow settings control how orchestration and transformation workloads consume capacity within a workspace. These settings are distinct from Spark settings and apply specifically to:
- Dataflow Gen2 compute allocation: How many concurrent dataflow refreshes can run
- Data Workflow (Airflow) resource allocation: Resources for DAG execution
- Concurrency limits: Maximum parallel executions per workspace
Key Data Workflow Configuration Options
| Setting | Purpose | When to Adjust |
|---|---|---|
| Concurrent Dataflow Refreshes | Limit parallel dataflow executions | High contention during batch windows |
| Compute Timeout | Maximum runtime before automatic termination | Long-running transformations |
| Resource Allocation | Memory and CPU for workflow execution | Complex DAGs with many tasks |
Configuring Data Workflow Settings
- Navigate to Workspace Settings → Data Engineering/Science
- Locate Data Workflow section
- Configure concurrency and resource limits based on capacity SKU
- Balance between parallelism and resource availability
Visual: Data Workflow Resource Management
⚠️ Exam Trap: Setting concurrency too high for your capacity SKU doesn't make things faster—it makes everything slower. Higher concurrency distributes resources more thinly, potentially causing all dataflows to crawl. Match concurrency to capacity and workload patterns.
Key Trade-Offs:
- High Concurrency vs. Individual Performance: More parallel dataflows mean each gets fewer resources
- Long Timeouts vs. Resource Blocking: Long timeouts protect large jobs but can block capacity if jobs hang