5.3.1. Compliance Frameworks and Data Lineage
💡 First Principle: Regulatory compliance for GenAI requires proving not just that outputs meet standards today, but that you can demonstrate how any specific output was generated, what data informed it, and what controls were active at the time. This requires systematic data lineage and decision logging from the first line of architecture.
Model cards with SageMaker Model Cards:
# Create a model card for a fine-tuned deployment
sagemaker.create_model_card(
ModelCardName='customer-support-fm-v2',
Content=json.dumps({
"model_overview": {
"model_id": "customer-support-fm-v2",
"base_model": "anthropic.claude-3-haiku",
"fine_tuning_dataset": "s3://my-bucket/training/support-conversations-2024/",
"intended_use": "Customer support question answering for Acme Corp products",
"out_of_scope_uses": ["Medical advice", "Legal guidance", "Financial advice"],
"model_version": "2.0"
},
"training_details": {
"training_data_description": "100K anonymized support conversations from 2022-2024",
"preprocessing_steps": ["PII removal", "Quality filtering >3/5 rating"],
"training_date": "2024-10-15"
},
"evaluation_details": {
"metrics": {"accuracy": 0.87, "hallucination_rate": 0.02, "safety_score": 0.98},
"evaluation_dataset": "s3://my-bucket/eval/golden-support-queries/",
"known_limitations": ["May underperform on technical hardware questions",
"Limited to English language"]
},
"responsible_ai": {
"bias_evaluation": "Tested across customer demographics — no significant bias detected",
"fairness_metrics": {"demographic_parity_diff": 0.03},
"guardrails_applied": "arn:aws:bedrock:...:guardrail/GUARDRAILID"
}
}),
ModelCardStatus='DRAFT'
)
Data lineage with AWS Glue Data Catalog: Every document that enters the RAG knowledge base should have a lineage record tracking its provenance:
# Register document lineage in Glue Data Catalog
glue.create_table(
DatabaseName='genai-data-catalog',
TableInput={
'Name': 'knowledge-base-documents',
'Description': 'Documents ingested into Bedrock Knowledge Base KBID12345',
'Parameters': {
'source_system': 'confluence-prod',
'ingestion_job_id': ingestion_job_id,
'ingestion_timestamp': datetime.utcnow().isoformat(),
'data_classification': 'internal',
'pii_scanned': 'true',
'pii_findings': 'none',
'knowledge_base_arn': 'arn:aws:bedrock:us-east-1:123456789:knowledge-base/KBID12345'
}
}
)
Metadata tagging for systematic attribution: Every AWS resource in the GenAI pipeline should carry standard tags for governance traceability:
GOVERNANCE_TAGS = [
{'Key': 'data-classification', 'Value': 'internal'},
{'Key': 'data-owner', 'Value': 'legal-team@company.com'},
{'Key': 'compliance-framework', 'Value': 'sox-hipaa'},
{'Key': 'pii-contains', 'Value': 'false'},
{'Key': 'retention-policy', 'Value': '7-years'},
{'Key': 'knowledge-base-id', 'Value': 'KBID12345'}
]
⚠️ Exam Trap: AWS Glue Data Catalog stores metadata about data assets (schema, location, lineage tags) — it does not store the actual data. When exam scenarios describe "tracking which documents were used to answer a specific FM query," that requires Bedrock Model Invocation Logs (which record the actual retrieved context in the request log), not Glue Data Catalog (which tracks data asset metadata).
Reflection Question: A financial regulator requests an audit of all FM-generated investment recommendations made to clients over the past 12 months, including the exact data sources used to generate each recommendation. What combination of AWS services provides this audit trail, and what must have been enabled before the recommendations were generated?