Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.3.2. Drift Detection and Continuous Evaluation

💡 First Principle: FM application quality drift occurs silently — model version updates, knowledge base staleness, prompt template changes, and input distribution shifts all degrade quality without raising errors. Catching drift requires scheduled evaluation jobs that compare current behavior against a fixed quality baseline.

Sources of quality drift:
Drift SourceFrequencyDetection Method
FM model updateUnpredictable (provider-side)Scheduled golden dataset eval after any model update
Knowledge base stalenessWeekly/monthlyRetrieval relevance score trending
Input distribution shiftGradual over monthsCompare current query embeddings to baseline cluster
Prompt regressionOn every deploymentCI/CD eval gate (Phase 4 pattern)
Guardrails trigger rate shiftGradualAlert on trigger rate % change > threshold
Scheduled evaluation with EventBridge + Bedrock Model Evaluations:
# EventBridge rule: run quality evaluation every Sunday at 2am
{
    "schedule": "cron(0 2 ? * SUN *)",
    "target": {
        "arn": "arn:aws:lambda:::function:run-weekly-quality-eval",
        "input": {
            "evaluation_dataset": "s3://my-bucket/golden-dataset/eval-v3.jsonl",
            "quality_threshold": 0.85,
            "alert_topic": "arn:aws:sns:...:quality-drift-alerts"
        }
    }
}

def run_evaluation(event, context):
    eval_job = bedrock.create_evaluation_job(
        jobName=f"weekly-eval-{datetime.utcnow().strftime('%Y%m%d')}",
        evaluationConfig={
            'automated': {
                'datasetMetricConfigs': [{
                    'taskType': 'QuestionAndAnswer',
                    'dataset': {'name': 'golden-qa', 's3Uri': event['evaluation_dataset']},
                    'metricNames': ['Helpfulness', 'Faithfulness', 'Coherence']
                }]
            }
        },
        outputDataConfig={'s3Uri': 's3://my-bucket/eval-results/'},
        roleArn=EVAL_ROLE_ARN
    )
    # Check results via Lambda scheduled 2 hours after job start
Knowledge base freshness monitoring:
def check_knowledge_base_freshness(kb_id):
    """Alert if documents haven't been synced in more than N days."""
    kb_info = bedrock_agent.get_knowledge_base(knowledgeBaseId=kb_id)
    last_sync = kb_info['knowledgeBase']['updatedAt']
    days_since_sync = (datetime.utcnow() - last_sync.replace(tzinfo=None)).days
    
    cloudwatch.put_metric_data(
        Namespace='GenAI/DataQuality',
        MetricData=[{
            'MetricName': 'KnowledgeBaseDaysSinceSync',
            'Value': days_since_sync,
            'Unit': 'Count',
            'Dimensions': [{'Name': 'KnowledgeBaseId', 'Value': kb_id}]
        }]
    )
    
    if days_since_sync > 7:
        sns.publish(TopicArn=FRESHNESS_ALERT_TOPIC,
                    Message=f"KB {kb_id} not synced in {days_since_sync} days")

⚠️ Exam Trap: Bedrock Model Evaluations runs on your golden dataset — the evaluation is only as good as the test cases you wrote. A golden dataset that doesn't cover edge cases, recent topics, or adversarial inputs will show consistently high scores while missing real-world quality problems. Treat golden dataset curation as an ongoing engineering responsibility, not a one-time setup task.

Reflection Question: Three months after launching your RAG knowledge base assistant, users notice it gives different answers to the same questions than it did at launch. No code changes have been deployed. What are the three most likely sources of behavior drift, and how would you determine which one is the actual cause?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications