Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.2.4. VPC Configuration for SageMaker Endpoints

💡 First Principle: By default, SageMaker training jobs and endpoints have internet access. For production workloads with sensitive data, you should run them inside a VPC with private subnets—isolating them from the public internet. This adds security but requires proper VPC endpoint configuration so SageMaker can still reach AWS services (S3, CloudWatch, ECR).

Network isolation is an additional SageMaker setting that blocks all network access from a training container—no internet, no other AWS services. Only the training data (provided via S3 input channels) is accessible. This is the most restrictive setting and is used for highly sensitive workloads where even VPC endpoint access to other services is prohibited.

Required VPC endpoints for SageMaker in private subnets:
VPC EndpointTypePurposeWhat Breaks Without It
S3GatewayRead/write training data and model artifactsTraining job can't access data
SageMaker APIInterfaceAPI calls (CreateTrainingJob, etc.)Can't launch or manage jobs
SageMaker RuntimeInterfaceInvoke endpoints for inferenceEndpoint invocations fail
CloudWatch LogsInterfacePush training/inference logsNo visibility into job output
ECRInterface (api + dkr)Pull container images"Unable to pull image" errors
STSInterfaceAssume IAM rolesRole-based access fails silently

The cost of VPC endpoints is a practical consideration the exam may reference. Interface endpoints (PrivateLink) cost ~$0.01/hour per AZ plus data processing charges. For a SageMaker deployment requiring five interface endpoints across two AZs, that's ~$73/month in endpoint costs alone — acceptable for production but potentially wasteful for development environments. Gateway endpoints (S3, DynamoDB) are free.

Security groups for SageMaker in VPC mode must allow outbound traffic on port 443 (HTTPS) to the VPC endpoint IP ranges. SageMaker also requires at least two subnets in different AZs for high availability. A common deployment mistake is configuring a single subnet, which causes endpoints to lack failover capability.

⚠️ Exam Trap: "VPC mode" and "network isolation" are different. VPC mode runs the job inside your VPC (can still access services via VPC endpoints). Network isolation blocks all outbound network access. If a question says "no internet access but needs to write logs to CloudWatch," the answer is VPC mode with VPC endpoints—not network isolation (which would block CloudWatch access too).

Reflection Question: A training job in VPC mode fails with "Unable to pull container image from ECR." What VPC configuration is likely missing?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications