4.2.4. VPC Configuration for SageMaker Endpoints
💡 First Principle: By default, SageMaker training jobs and endpoints have internet access. For production workloads with sensitive data, you should run them inside a VPC with private subnets—isolating them from the public internet. This adds security but requires proper VPC endpoint configuration so SageMaker can still reach AWS services (S3, CloudWatch, ECR).
Network isolation is an additional SageMaker setting that blocks all network access from a training container—no internet, no other AWS services. Only the training data (provided via S3 input channels) is accessible. This is the most restrictive setting and is used for highly sensitive workloads where even VPC endpoint access to other services is prohibited.
Required VPC endpoints for SageMaker in private subnets:
| VPC Endpoint | Type | Purpose | What Breaks Without It |
|---|---|---|---|
| S3 | Gateway | Read/write training data and model artifacts | Training job can't access data |
| SageMaker API | Interface | API calls (CreateTrainingJob, etc.) | Can't launch or manage jobs |
| SageMaker Runtime | Interface | Invoke endpoints for inference | Endpoint invocations fail |
| CloudWatch Logs | Interface | Push training/inference logs | No visibility into job output |
| ECR | Interface (api + dkr) | Pull container images | "Unable to pull image" errors |
| STS | Interface | Assume IAM roles | Role-based access fails silently |
The cost of VPC endpoints is a practical consideration the exam may reference. Interface endpoints (PrivateLink) cost ~$0.01/hour per AZ plus data processing charges. For a SageMaker deployment requiring five interface endpoints across two AZs, that's ~$73/month in endpoint costs alone — acceptable for production but potentially wasteful for development environments. Gateway endpoints (S3, DynamoDB) are free.
Security groups for SageMaker in VPC mode must allow outbound traffic on port 443 (HTTPS) to the VPC endpoint IP ranges. SageMaker also requires at least two subnets in different AZs for high availability. A common deployment mistake is configuring a single subnet, which causes endpoints to lack failover capability.
⚠️ Exam Trap: "VPC mode" and "network isolation" are different. VPC mode runs the job inside your VPC (can still access services via VPC endpoints). Network isolation blocks all outbound network access. If a question says "no internet access but needs to write logs to CloudWatch," the answer is VPC mode with VPC endpoints—not network isolation (which would block CloudWatch access too).
Reflection Question: A training job in VPC mode fails with "Unable to pull container image from ECR." What VPC configuration is likely missing?