5.1.2. VPC Security, Endpoints, and Credential Management
š” First Principle: A VPC (Virtual Private Cloud) is a network perimeter that isolates your data resources. Data stores inside a VPC (RDS, Redshift, EMR) are inaccessible from the public internet by default. To access these resources from Glue, Lambda, or other services, you must configure VPC endpoints, security groups, and network routes ā creating controlled pathways instead of open doors.
VPC endpoints provide private connectivity between your VPC and AWS services without traversing the internet. Gateway endpoints (free) support S3 and DynamoDB. Interface endpoints (powered by PrivateLink, per-hour + per-GB charge) support most other services (Glue, Kinesis, KMS, Secrets Manager). For data engineering, VPC endpoints ensure that data in transit between services never leaves the AWS network.
Security groups act as stateful firewalls for individual resources. A Redshift cluster's security group defines which IP ranges and other security groups can connect. A Glue job running in a VPC needs a security group that allows outbound traffic to S3 (via gateway endpoint) and the Glue Data Catalog (via interface endpoint).
AWS Secrets Manager stores and rotates secrets (database passwords, API keys, credentials). Automatic rotation changes passwords on a schedule without application downtime. For data pipelines, Secrets Manager is the answer when questions mention "rotate database credentials" or "manage secrets securely."
AWS Systems Manager Parameter Store stores configuration values (connection strings, feature flags) and secure strings (encrypted parameters). It's simpler and cheaper than Secrets Manager but lacks automatic rotation. If a question mentions "store and retrieve configuration parameters," Parameter Store is the answer. If it mentions "automatic credential rotation," it's Secrets Manager.
SageMaker Unified Studio (v1.1) introduces domains, domain units, and projects for organizing access. Domains define an organizational boundary, domain units group related data assets, and projects provide workspace-level access control. This layers governance on top of IAM.
ā ļø Exam Trap: Glue jobs running inside a VPC lose internet access by default ā they can't reach the Glue service endpoint or S3 without VPC endpoints. If a Glue job fails with connection timeout errors after being configured for VPC access, the fix is adding VPC endpoints for S3 (gateway) and Glue (interface), not opening internet access.
Reflection Question: A Lambda function needs to read from an RDS database inside a VPC and write results to S3. What VPC configuration is required, and what's the common pitfall?