3.2.4. Distributed Training and Reducing Training Time
💡 First Principle: When a single machine can't train your model in an acceptable time, you distribute the work across multiple machines. There are two fundamental strategies: data parallelism (same model, split data) and model parallelism (split model, same data). Choosing wrong wastes resources—data parallelism on a model that fits on one GPU adds communication overhead with no benefit.
| Strategy | How It Works | When to Use | SageMaker Support |
|---|---|---|---|
| Data parallelism | Each GPU gets a copy of the model and a portion of the data; gradients are synchronized | Model fits on one GPU; dataset is large | SageMaker Distributed Data Parallel library |
| Model parallelism | The model is split across GPUs; each GPU holds a portion | Model is too large for one GPU (e.g., LLMs) | SageMaker Distributed Model Parallel library |
Other techniques to reduce training time:
Early stopping: Halt training when validation metric plateaus. No wasted epochs training a model that's no longer improving.
Pipe mode / Fast File mode: Stream data from S3 instead of downloading it all first. Eliminates the "data download phase" at the start of training.
Spot Instances: Use SageMaker managed spot training to save 60-90% on compute costs. Training automatically checkpoints and resumes if a Spot Instance is interrupted. The trade-off is that interruptions add to wall-clock time, even though costs decrease.
Mixed precision training: Use FP16 (half-precision) instead of FP32 for computations, with FP32 for critical operations. Doubles training speed on modern GPUs (V100, A100) with minimal accuracy loss.
⚠️ Exam Trap: Spot Instances reduce cost but may increase wall-clock time due to interruptions. If a question asks about reducing cost, Spot Instances are correct. If it asks about reducing time with no tolerance for delays, Spot Instances are wrong—use On-Demand or Reserved instances instead. Read whether the question optimizes for cost or time.
Reflection Question: Training a model on a single ml.p3.2xlarge takes 24 hours. The team has a deadline in 8 hours. Which combination of distributed training, data streaming, and compute options would meet the deadline?