7.2. Troubleshooting GenAI Applications
💡 First Principle: GenAI troubleshooting is systematically different from traditional application debugging because the failure mode is usually not an error — it's a quality degradation that manifests as wrong, unhelpful, or harmful outputs that look correct to the infrastructure layer. Effective troubleshooting starts with classifying which component produced the failure, then examining that component's specific failure modes.
The five-component fault isolation model for RAG applications: (1) Input preprocessing, (2) Retrieval pipeline, (3) Context assembly, (4) FM inference, (5) Output processing. Each has distinct failure signatures. Jumping to the FM as the default culprit — as many practitioners do — leads to expensive model switching when the actual problem is in the retrieval pipeline.
⚠️ Common Misconception: If the FM output is wrong, the fix is to use a better model. In the majority of production RAG failures, the FM is functioning correctly — it is accurately summarizing or responding to the context it received. The problem is that the retrieved context was wrong, incomplete, or irrelevant. Fixing retrieval quality is almost always faster and cheaper than upgrading models.
| Symptom | Most Likely Component | Diagnostic Step | Common Fix |
|---|---|---|---|
| Hallucinated facts not in any document | FM inference | Check if relevant context was retrieved | Add grounding check; improve retrieval |
| Correct facts but wrong answer structure | FM inference / output | Check output format instructions | Strengthen system prompt output spec |
| No relevant chunks retrieved | Retrieval pipeline | Log retrieved chunks per query | Adjust chunk size, switch to hybrid search |
| Agent loops endlessly | Tool integration | Inspect tool return format | Fix Lambda return schema to match agent expectation |
| Correct answer for simple queries, wrong for complex | Model capability | Check query complexity distribution | Route complex queries to more capable model |