1.4.1. Cost vs. Performance vs. Latency
💡 First Principle: Cost, performance, and latency form a triangle where improving one often worsens another. The exam tests whether you can identify which vertex the scenario prioritizes and choose the AWS service that optimizes for it.
Here's how these trade-offs manifest in real exam scenarios:
Latency-optimized: A real-time fraud detection system needs sub-100ms predictions. You'd choose a real-time SageMaker endpoint with provisioned compute, a GPU instance if the model is large, and pre-loaded model artifacts. This is the most expensive option but meets the latency constraint.
Cost-optimized: A weekly customer segmentation job processes millions of records. You'd choose SageMaker Batch Transform or a SageMaker Processing job with Spot Instances. No persistent endpoint needed—pay only when the job runs.
Performance-optimized: Training a large language model requires maximum throughput. You'd choose multi-GPU instances (like ml.p4d.24xlarge), distributed training with data parallelism, and SageMaker's managed training infrastructure to handle the complexity.
| Scenario | Optimize For | AWS Choice | Why Not The Alternative |
|---|---|---|---|
| Fraud detection (real-time) | Latency | Real-time endpoint + GPU | Serverless has cold starts |
| Weekly batch scoring | Cost | Batch Transform + Spot | Persistent endpoint wastes money |
| Large model training | Performance | Multi-GPU + distributed training | Single instance too slow |
| Intermittent inference (<1/min) | Cost | Serverless endpoint | Real-time endpoint stays warm unnecessarily |
| Video analysis pipeline | Latency + Performance | Async endpoint + GPU | Real-time can't handle large payloads |
⚠️ Exam Trap: "Serverless" doesn't always mean "cheapest." Serverless SageMaker endpoints have cold starts that add latency and are limited in model size. If the scenario describes consistent, high-frequency traffic, a provisioned real-time endpoint may be both faster and cheaper per-request than serverless. Read the traffic pattern carefully.
Reflection Question: A startup processes 10 requests per hour during the day and zero at night. They need sub-second latency. What endpoint type balances cost and latency?