6.3.2. Drift Detection and Continuous Evaluation
💡 First Principle: FM application quality drift occurs silently — model version updates, knowledge base staleness, prompt template changes, and input distribution shifts all degrade quality without raising errors. Catching drift requires scheduled evaluation jobs that compare current behavior against a fixed quality baseline.
Sources of quality drift:
| Drift Source | Frequency | Detection Method |
|---|---|---|
| FM model update | Unpredictable (provider-side) | Scheduled golden dataset eval after any model update |
| Knowledge base staleness | Weekly/monthly | Retrieval relevance score trending |
| Input distribution shift | Gradual over months | Compare current query embeddings to baseline cluster |
| Prompt regression | On every deployment | CI/CD eval gate (Phase 4 pattern) |
| Guardrails trigger rate shift | Gradual | Alert on trigger rate % change > threshold |
Scheduled evaluation with EventBridge + Bedrock Model Evaluations:
# EventBridge rule: run quality evaluation every Sunday at 2am
{
"schedule": "cron(0 2 ? * SUN *)",
"target": {
"arn": "arn:aws:lambda:::function:run-weekly-quality-eval",
"input": {
"evaluation_dataset": "s3://my-bucket/golden-dataset/eval-v3.jsonl",
"quality_threshold": 0.85,
"alert_topic": "arn:aws:sns:...:quality-drift-alerts"
}
}
}
def run_evaluation(event, context):
eval_job = bedrock.create_evaluation_job(
jobName=f"weekly-eval-{datetime.utcnow().strftime('%Y%m%d')}",
evaluationConfig={
'automated': {
'datasetMetricConfigs': [{
'taskType': 'QuestionAndAnswer',
'dataset': {'name': 'golden-qa', 's3Uri': event['evaluation_dataset']},
'metricNames': ['Helpfulness', 'Faithfulness', 'Coherence']
}]
}
},
outputDataConfig={'s3Uri': 's3://my-bucket/eval-results/'},
roleArn=EVAL_ROLE_ARN
)
# Check results via Lambda scheduled 2 hours after job start
Knowledge base freshness monitoring:
def check_knowledge_base_freshness(kb_id):
"""Alert if documents haven't been synced in more than N days."""
kb_info = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb_id)
last_sync = kb_info['knowledgeBase']['updatedAt']
days_since_sync = (datetime.utcnow() - last_sync.replace(tzinfo=None)).days
cloudwatch.put_metric_data(
Namespace='GenAI/DataQuality',
MetricData=[{
'MetricName': 'KnowledgeBaseDaysSinceSync',
'Value': days_since_sync,
'Unit': 'Count',
'Dimensions': [{'Name': 'KnowledgeBaseId', 'Value': kb_id}]
}]
)
if days_since_sync > 7:
sns.publish(TopicArn=FRESHNESS_ALERT_TOPIC,
Message=f"KB {kb_id} not synced in {days_since_sync} days")
⚠️ Exam Trap: Bedrock Model Evaluations runs on your golden dataset — the evaluation is only as good as the test cases you wrote. A golden dataset that doesn't cover edge cases, recent topics, or adversarial inputs will show consistently high scores while missing real-world quality problems. Treat golden dataset curation as an ongoing engineering responsibility, not a one-time setup task.
Reflection Question: Three months after launching your RAG knowledge base assistant, users notice it gives different answers to the same questions than it did at launch. No code changes have been deployed. What are the three most likely sources of behavior drift, and how would you determine which one is the actual cause?