4.2.1. Auto Scaling SageMaker Endpoints
Auto scaling for SageMaker endpoints uses target tracking as the default policy — you set a target like InvocationsPerInstance = 70, and SageMaker adjusts capacity automatically. The critical nuance is cooldown periods: scale-out cooldown (default 300s) prevents rapid instance additions during traffic spikes, while scale-in cooldown (default 600s) prevents premature removal during brief traffic lulls. The exam frequently tests the interplay between scaling and VPC configuration — an endpoint in a private VPC needs interface endpoints for CloudWatch (to publish metrics that drive scaling) and gateway endpoints for S3 (to load model artifacts). Missing either dependency causes silent failures that are hard to diagnose.
💡 First Principle: Auto scaling adjusts the number of instances behind an endpoint based on demand, ensuring you have enough capacity during peaks without paying for idle instances during troughs. The exam tests which scaling metric and policy to use for ML-specific workloads.
| Scaling Policy | How It Works | Best For |
|---|---|---|
| Target tracking | Maintains a metric at a target value | Steady scaling (e.g., keep invocations/instance at 100) |
| Step scaling | Adds/removes capacity in steps based on alarm thresholds | Rapid response to traffic spikes |
| Scheduled scaling | Pre-configures capacity at specific times | Predictable traffic patterns (business hours) |
Key metrics for auto scaling ML endpoints:
| Metric | What It Measures | Scale When |
|---|---|---|
| InvocationsPerInstance | Request load per instance | Exceeds target (e.g., >100/instance) |
| ModelLatency | Time to run inference | Exceeds SLA threshold |
| CPUUtilization | Compute usage | Consistently >70% |
| GPUUtilization | GPU usage | Consistently >80% |
Spot Instances behind endpoints aren't supported for real-time inference (Spot interruptions would drop requests). However, Spot Instances work well for training jobs and batch transform where interruptions can be handled via checkpointing.
⚠️ Exam Trap: Auto scaling has a cooldown period—after scaling up, it waits before scaling again. If a question describes rapid, spiky traffic and the model can't scale fast enough, the fix might be increasing the minimum instance count (warm pool) rather than adjusting the scaling policy. Also, scaling to zero instances is only possible with serverless endpoints—standard auto scaling has a minimum of 1 instance.
Reflection Question: An endpoint receives 50 requests/second during business hours and 2 requests/second at night. Traffic doubles during promotional events (unpredictable). What combination of scaling policies would you configure?