Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.1.1. Endpoint Types: Real-Time, Serverless, Async, and Batch

The endpoint selection decision tree is one of the most frequently tested patterns on the exam. Start with the latency requirement: if sub-second response is needed, it's real-time. If the response can wait minutes, async handles payloads up to 1 GB. If there's no request at all — just a batch of data to score — Batch Transform processes it as a job. Multi-model endpoints shine when you have hundreds or thousands of small models (think per-customer personalization) that would be prohibitively expensive to host individually. Serverless endpoints are ideal for sporadic traffic with tolerance for cold-start latency — they scale to zero when idle, eliminating cost during quiet periods. The exam often describes traffic patterns and asks you to pick the optimal endpoint type, so learning to match traffic shape to endpoint economics is critical.

💡 First Principle: SageMaker offers four endpoint types, each optimized for a different combination of latency requirement, payload size, traffic pattern, and cost tolerance. The exam gives you these four signals and expects you to pick the right endpoint.

Endpoint TypeLatencyPayload LimitTraffic PatternCost Model
Real-timeMilliseconds6 MBSteady, high volumePay for always-on instances
ServerlessSeconds (cold start)4 MBSporadic, low/variablePay per inference
AsynchronousMinutes1 GBLarge payloads, tolerant of delayPay for processing time
Batch TransformHoursS3 datasetNo real-time need, bulk scoringPay per job (no persistent endpoint)

Multi-model endpoints host multiple models on a single endpoint, dynamically loading the requested model into memory. Cost-effective when you have many models with low individual traffic. The trade-off is latency: cold loading a model takes seconds.

Multi-container endpoints run multiple containers in sequence (serial inference pipeline) or independently. Use serial pipelines for pre-processing → inference → post-processing chains.

⚠️ Exam Trap: Serverless endpoints have cold starts—the first request after inactivity adds several seconds of latency. If a question mentions "consistent sub-second latency" and "variable traffic," the answer is a real-time endpoint with auto scaling, not serverless. Serverless is only correct when some latency variability is acceptable.

Reflection Question: A medical imaging company processes CT scans (500 MB each) for tumor detection. Results are needed within 15 minutes. Which endpoint type is appropriate, and why are the other three wrong?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications