7.2.1. Diagnostic Methodology and Common Failure Patterns
💡 First Principle: The diagnostic sequence for FM application failures mirrors the data flow: start at the input, verify each transformation step produces valid intermediate output, and identify the first step where the intermediate output diverges from expected. Never skip to the final output and work backward.
Five-step diagnostic sequence:
Common failure patterns and their signatures:
| Failure Pattern | Symptom | Root Cause | Diagnostic Evidence |
|---|---|---|---|
| Retrieval miss | FM says "I don't have information about X" but docs exist | Chunking too coarse; wrong embedding model; poor query | Context Precision < 0.5 in RAGAS |
| Context overflow | FM ignores key information in long contexts | Context window saturated; key info buried | Input token count near model limit |
| Hallucination | FM states facts not in retrieved context | Grounding too weak; model filling gaps | Faithfulness < 0.7; response claims not in chunks |
| Indirect injection | FM produces off-topic or adversarial responses | Malicious content in retrieved docs | Guardrails blocked at output; anomalous retrieved content |
| Guardrails over-triggering | Legitimate queries rejected | Topic denial too broad; word filter too aggressive | Guardrail trace shows false positive topic match |
| Stale knowledge | FM gives outdated information | Knowledge Base not synced; embedding model changed | KnowledgeBaseDaysSinceSync metric high |
| Conversation context loss | FM forgets earlier conversation turns | History too long; truncated on sliding window | Check conversation history length at query time |
Debugging retrieval failures with CloudWatch Logs Insights:
# CloudWatch Logs Insights query to identify low-retrieval-score queries
query = """
fields @timestamp, query, top_retrieval_score, kb_id
| filter top_retrieval_score < 0.5
| sort @timestamp desc
| limit 100
"""
# High volume of low-score retrievals → retrieval pipeline problem
# Specific query types failing → those topics may not be in Knowledge Base
Debugging Guardrails over-triggering:
def analyze_guardrail_triggers(start_time, end_time):
"""Query invocation logs for Guardrails trigger analysis."""
logs = cloudwatch_logs.filter_log_events(
logGroupName='/aws/bedrock/model-invocations',
startTime=start_time,
endTime=end_time,
filterPattern='{ $.stopReason = "guardrail_intervened" }'
)
trigger_reasons = defaultdict(int)
for event in logs['events']:
log_entry = json.loads(event['message'])
policy = log_entry.get('guardrailTrace', {}).get('triggeredPolicy', 'unknown')
trigger_reasons[policy] += 1
# Identify top triggering policies — if topic denial firing on legitimate queries,
# the topic description may be too broad
return sorted(trigger_reasons.items(), key=lambda x: x[1], reverse=True)
⚠️ Exam Trap: Bedrock Model Invocation Logs show the actual prompt sent to the model and the actual response received — but only if logging was enabled before the problematic invocations occurred. Retroactive logging enablement cannot reconstruct past invocations. Enable invocation logging from day one, before the first production deployment.
Reflection Question: Users report that your RAG bot correctly answers "What is our refund policy?" but consistently fails on "Can I return a gift I received?" — giving a generic response despite a clear gift return policy in your knowledge base. Using the five-step diagnostic sequence, what would you check at each step, and which step is most likely the failure point?