7.2.3. Quality Troubleshooting Runbook
💡 First Principle: Quality issues in FM applications require a systematic runbook, not ad-hoc investigation, because the failure signatures overlap — hallucination, retrieval failure, and prompt injection can all manifest as "wrong answer." Only systematic elimination through the diagnostic sequence determines the true root cause.
Quality issue runbook:
Issue: FM produces confident incorrect answers (hallucination)
1. Check Bedrock Invocation Logs — was retrieved context present in the request?
→ If NO context in prompt: retrieval pipeline failed to return results
→ If context present: proceed to step 2
2. Compare response claims against retrieved chunks — are claims supported?
→ Use RAGAS Faithfulness score on the specific failing query
→ If unsupported claims: grounding prompt too weak → strengthen system prompt
3. Check Guardrails grounding check score
→ If grounding check disabled or threshold too low → enable/raise threshold
4. Check model invocation logs for retrieved chunk scores
→ Low scores + hallucination → retrieval precision problem, not FM problem
Issue: FM refuses legitimate queries
1. Check Guardrails trace in invocation logs — which policy triggered?
→ Topic denial: topic description may be too broad
→ Content filter: query may contain a word triggering filter
→ Word filter: custom word list may have false positives
2. Test the query against Guardrails in isolation using ApplyGuardrail API
3. Refine policy: narrow topic description or adjust filter thresholds
4. If no Guardrails trigger: check system prompt — overly restrictive instructions
Issue: FM answers are inconsistent across identical queries
1. Check temperature setting — temperature > 0.3 introduces significant variation
→ For factual Q&A: set temperature to 0.0–0.1 for deterministic output
2. Check if Knowledge Base sync has run recently
→ Different sync states across parallel replicas → answers from different doc versions
3. Check retrieval — is k-NN search returning different top chunks across calls?
→ HNSW approximate search has non-determinism → enable exact k-NN for debugging
4. Check Lambda version routing — are multiple Lambda versions running?
→ A/B test or canary may be routing to different prompt versions
Issue: Agents loop without completing tasks
1. Enable agent tracing → examine reasoning trace for circular reasoning pattern
2. Check tool return values — are tools returning errors the agent misinterprets?
3. Verify tool schemas — ambiguous tool descriptions cause agent confusion
4. Check maximum iteration limit — if not set, agents will loop until timeout
5. Examine stopping condition — agent may lack a clear signal for task completion
⚠️ Exam Trap: Agent looping is often caused by a tool returning an error format the agent doesn't recognize as an error — it treats the error message as partial information and keeps trying variations of the same call. Always return structured error responses from action group Lambda functions in the format the agent's underlying model expects: {"error": "description", "error_code": "NOT_FOUND"} rather than raising an unhandled exception.
Reflection Question: A Bedrock Agent tasked with "summarize all open support tickets from the past 7 days" completes successfully for most runs but occasionally loops 15+ times and times out. The tracing shows it successfully retrieves ticket data on the first tool call. What in the agent configuration or tool implementation would you investigate first, and what is the likely root cause?