2.3.4. Compliance, Encryption, and Data Protection
💡 First Principle: ML data often contains sensitive information—personally identifiable information (PII), protected health information (PHI), financial records—and regulations require you to protect it at every stage of the lifecycle. Encryption and access controls aren't optional add-ons; they're foundational requirements that must be designed into the data pipeline from the start.
Consider what happens when a healthcare company trains a model on patient records without proper data protection. Even if the model is excellent, a single audit failure could result in regulatory fines, legal liability, and loss of trust. The exam tests whether you know the AWS tools and practices that prevent this.
Encryption Strategy:
| Layer | What to Encrypt | AWS Service | Key Type |
|---|---|---|---|
| At rest (S3) | Training data, model artifacts | SSE-S3, SSE-KMS, SSE-C | AWS managed or customer managed KMS key |
| At rest (EBS) | Training instance volumes | EBS encryption with KMS | AWS managed or customer managed |
| In transit | Data moving between services | TLS 1.2+ (enabled by default) | Certificate-based |
| In SageMaker | Training job data, notebook data | SageMaker KMS integration | Customer managed KMS key recommended |
Data Classification and Protection:
| Technique | Purpose | AWS Tool |
|---|---|---|
| PII detection | Find personal data in datasets | Amazon Comprehend, Amazon Macie |
| Data masking | Replace sensitive values with anonymized versions | Glue DataBrew, custom Lambda |
| Anonymization | Remove identifying information irreversibly | Custom transforms in Glue |
| Tokenization | Replace sensitive data with non-sensitive tokens | Custom solution with DynamoDB mapping |
| Data residency | Ensure data stays in specific regions | S3 bucket policies, SageMaker region selection |
Amazon Macie automatically discovers and classifies sensitive data in S3 buckets using ML. It identifies PII, PHI, and financial data, and generates findings that integrate with Security Hub. For ML data pipelines, Macie acts as an automated scanner that flags datasets containing sensitive data before they enter training.
Compliance Implications for ML:
- HIPAA (healthcare): PHI must be encrypted, access logged, and BAA in place with AWS
- GDPR (EU data): Right to erasure means you may need to retrain models when a user requests data deletion
- Data residency: Some regulations require data to stay in specific geographic regions—affects which AWS Region you train in
⚠️ Exam Trap: Encryption at rest and in transit are separate configurations. A question might describe data encrypted in S3 (at rest) but ask about protecting it during transfer to a training instance. The answer involves TLS/SSL for in-transit encryption, not KMS. Also, SageMaker training jobs can be configured with inter-container traffic encryption for distributed training—a separate setting from storage encryption.
Reflection Question: A healthcare company needs to train a model on patient records containing PHI. What encryption, access control, and compliance configurations must be in place before training can begin?