4.1.1. Endpoint Types: Real-Time, Serverless, Async, and Batch
The endpoint selection decision tree is one of the most frequently tested patterns on the exam. Start with the latency requirement: if sub-second response is needed, it's real-time. If the response can wait minutes, async handles payloads up to 1 GB. If there's no request at all — just a batch of data to score — Batch Transform processes it as a job. Multi-model endpoints shine when you have hundreds or thousands of small models (think per-customer personalization) that would be prohibitively expensive to host individually. Serverless endpoints are ideal for sporadic traffic with tolerance for cold-start latency — they scale to zero when idle, eliminating cost during quiet periods. The exam often describes traffic patterns and asks you to pick the optimal endpoint type, so learning to match traffic shape to endpoint economics is critical.
💡 First Principle: SageMaker offers four endpoint types, each optimized for a different combination of latency requirement, payload size, traffic pattern, and cost tolerance. The exam gives you these four signals and expects you to pick the right endpoint.
| Endpoint Type | Latency | Payload Limit | Traffic Pattern | Cost Model |
|---|---|---|---|---|
| Real-time | Milliseconds | 6 MB | Steady, high volume | Pay for always-on instances |
| Serverless | Seconds (cold start) | 4 MB | Sporadic, low/variable | Pay per inference |
| Asynchronous | Minutes | 1 GB | Large payloads, tolerant of delay | Pay for processing time |
| Batch Transform | Hours | S3 dataset | No real-time need, bulk scoring | Pay per job (no persistent endpoint) |
Multi-model endpoints host multiple models on a single endpoint, dynamically loading the requested model into memory. Cost-effective when you have many models with low individual traffic. The trade-off is latency: cold loading a model takes seconds.
Multi-container endpoints run multiple containers in sequence (serial inference pipeline) or independently. Use serial pipelines for pre-processing → inference → post-processing chains.
⚠️ Exam Trap: Serverless endpoints have cold starts—the first request after inactivity adds several seconds of latency. If a question mentions "consistent sub-second latency" and "variable traffic," the answer is a real-time endpoint with auto scaling, not serverless. Serverless is only correct when some latency variability is acceptable.
Reflection Question: A medical imaging company processes CT scans (500 MB each) for tumor detection. Results are needed within 15 minutes. Which endpoint type is appropriate, and why are the other three wrong?