Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.2.1. Auto Scaling SageMaker Endpoints

Auto scaling for SageMaker endpoints uses target tracking as the default policy — you set a target like InvocationsPerInstance = 70, and SageMaker adjusts capacity automatically. The critical nuance is cooldown periods: scale-out cooldown (default 300s) prevents rapid instance additions during traffic spikes, while scale-in cooldown (default 600s) prevents premature removal during brief traffic lulls. The exam frequently tests the interplay between scaling and VPC configuration — an endpoint in a private VPC needs interface endpoints for CloudWatch (to publish metrics that drive scaling) and gateway endpoints for S3 (to load model artifacts). Missing either dependency causes silent failures that are hard to diagnose.

💡 First Principle: Auto scaling adjusts the number of instances behind an endpoint based on demand, ensuring you have enough capacity during peaks without paying for idle instances during troughs. The exam tests which scaling metric and policy to use for ML-specific workloads.

Scaling PolicyHow It WorksBest For
Target trackingMaintains a metric at a target valueSteady scaling (e.g., keep invocations/instance at 100)
Step scalingAdds/removes capacity in steps based on alarm thresholdsRapid response to traffic spikes
Scheduled scalingPre-configures capacity at specific timesPredictable traffic patterns (business hours)
Key metrics for auto scaling ML endpoints:
MetricWhat It MeasuresScale When
InvocationsPerInstanceRequest load per instanceExceeds target (e.g., >100/instance)
ModelLatencyTime to run inferenceExceeds SLA threshold
CPUUtilizationCompute usageConsistently >70%
GPUUtilizationGPU usageConsistently >80%

Spot Instances behind endpoints aren't supported for real-time inference (Spot interruptions would drop requests). However, Spot Instances work well for training jobs and batch transform where interruptions can be handled via checkpointing.

⚠️ Exam Trap: Auto scaling has a cooldown period—after scaling up, it waits before scaling again. If a question describes rapid, spiky traffic and the model can't scale fast enough, the fix might be increasing the minimum instance count (warm pool) rather than adjusting the scaling policy. Also, scaling to zero instances is only possible with serverless endpoints—standard auto scaling has a minimum of 1 instance.

Reflection Question: An endpoint receives 50 requests/second during business hours and 2 requests/second at night. Traffic doubles during promotional events (unpredictable). What combination of scaling policies would you configure?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications