Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

5.2.2. PII Detection and Privacy Preservation

💡 First Principle: PII that enters an FM's context window is processed and potentially referenced in outputs — not just stored and encrypted. Privacy preservation for GenAI requires detecting and acting on PII before it reaches the model, not just securing it at rest.

PII detection and handling pipeline:

Amazon Macie for S3 data classification: Before documents enter your RAG knowledge base, Macie scans S3 buckets to identify sensitive data patterns — SSNs, credit card numbers, passport numbers, medical record identifiers — and classifies the risk level:

# Trigger Macie classification job on new document uploads
def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']
    
    # Check Macie findings for this object before ingesting into Knowledge Base
    findings = get_macie_findings_for_object(bucket, key)
    
    if any(f['severity'] in ['HIGH', 'CRITICAL'] for f in findings):
        # Document contains high-risk PII — quarantine it
        move_to_quarantine(bucket, key)
        notify_data_owner(key, findings)
        return  # Do not ingest into Knowledge Base
    
    # Safe to ingest
    trigger_knowledge_base_sync(key)
Anonymization strategies for sensitive data in RAG:
StrategyHow It WorksWhen to Use
TokenizationReplace PII with non-sensitive token (UUID)When you need to de-reference later
MaskingReplace with asterisks or partial revealWhen display is needed but full value is not
GeneralizationReplace specific value with range (age 35 → "30-40")Analytics use cases
PseudonymizationReplace with consistent fake value (John Smith → Person A)When relationships need preserving
Differential privacyAdd statistical noise to prevent re-identificationML training datasets

⚠️ Exam Trap: Amazon Comprehend detects PII in text. Amazon Macie detects sensitive data patterns in S3 objects (files). They are complementary: Macie for batch scanning of stored documents, Comprehend for real-time PII detection in user inputs and FM outputs. Exam scenarios about "detecting PII in user queries before FM processing" point to Comprehend; scenarios about "finding sensitive data in S3 before ingesting into Knowledge Base" point to Macie.

Reflection Question: A healthcare company wants to use an FM to answer patient questions about their own health records. Patient records contain diagnoses (PHI). What is the privacy-preserving architecture that allows the FM to reference a patient's record in answering only that patient's questions, without any PHI being permanently stored in the vector index or retrievable by other patients?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications