3.3.3. Debugging Convergence with SageMaker Debugger
💡 First Principle: When a training run produces a bad model, the question is why. Was it the data, the hyperparameters, or the model architecture? SageMaker Debugger captures training internals (tensors, gradients, activations) during training and applies rules that detect common failure patterns—letting you diagnose without re-running experiments.
| Built-in Rule | What It Detects | Symptom |
|---|---|---|
| VanishingGradient | Gradients approaching zero | Deep networks stop learning in early layers |
| ExplodingGradient | Gradients growing unboundedly | Loss becomes NaN or jumps wildly |
| Overfit | Training loss drops, validation loss rises | Good training metrics, poor generalization |
| LossNotDecreasing | Loss plateaus | Training isn't making progress |
| LowGPUUtilization | GPU underutilized during training | Paying for compute you're not using |
Debugger works by registering hooks in the training script that capture tensor values at configurable intervals. These tensors are saved to S3 and evaluated against rules. When a rule triggers, it can generate a CloudWatch alert or stop the training job (saving costs on a run that's clearly failing).
⚠️ Exam Trap: Debugger detects issues during training. Model Monitor detects issues after deployment. If a question describes a training job where loss becomes NaN, the answer involves Debugger. If it describes a deployed model whose accuracy has degraded, the answer involves Model Monitor. Don't confuse training-time debugging with production monitoring.
Reflection Question: A deep neural network's training loss drops for 3 epochs then suddenly jumps to NaN. Which Debugger rule would have caught this, what's the likely cause, and what's the fix?