5.2.2. PII Detection and Privacy Preservation
💡 First Principle: PII that enters an FM's context window is processed and potentially referenced in outputs — not just stored and encrypted. Privacy preservation for GenAI requires detecting and acting on PII before it reaches the model, not just securing it at rest.
PII detection and handling pipeline:
Amazon Macie for S3 data classification: Before documents enter your RAG knowledge base, Macie scans S3 buckets to identify sensitive data patterns — SSNs, credit card numbers, passport numbers, medical record identifiers — and classifies the risk level:
# Trigger Macie classification job on new document uploads
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = event['Records'][0]['s3']['object']['key']
# Check Macie findings for this object before ingesting into Knowledge Base
findings = get_macie_findings_for_object(bucket, key)
if any(f['severity'] in ['HIGH', 'CRITICAL'] for f in findings):
# Document contains high-risk PII — quarantine it
move_to_quarantine(bucket, key)
notify_data_owner(key, findings)
return # Do not ingest into Knowledge Base
# Safe to ingest
trigger_knowledge_base_sync(key)
Anonymization strategies for sensitive data in RAG:
| Strategy | How It Works | When to Use |
|---|---|---|
| Tokenization | Replace PII with non-sensitive token (UUID) | When you need to de-reference later |
| Masking | Replace with asterisks or partial reveal | When display is needed but full value is not |
| Generalization | Replace specific value with range (age 35 → "30-40") | Analytics use cases |
| Pseudonymization | Replace with consistent fake value (John Smith → Person A) | When relationships need preserving |
| Differential privacy | Add statistical noise to prevent re-identification | ML training datasets |
⚠️ Exam Trap: Amazon Comprehend detects PII in text. Amazon Macie detects sensitive data patterns in S3 objects (files). They are complementary: Macie for batch scanning of stored documents, Comprehend for real-time PII detection in user inputs and FM outputs. Exam scenarios about "detecting PII in user queries before FM processing" point to Comprehend; scenarios about "finding sensitive data in S3 before ingesting into Knowledge Base" point to Macie.
Reflection Question: A healthcare company wants to use an FM to answer patient questions about their own health records. Patient records contain diagnoses (PHI). What is the privacy-preserving architecture that allows the FM to reference a patient's record in answering only that patient's questions, without any PHI being permanently stored in the vector index or retrievable by other patients?