7. Testing, Validation, and Troubleshooting (11%)
Domain 5 is the diagnostic domain — it tests whether you can tell the difference between a good FM system and a bad one, and whether you can fix a bad one when it breaks. Unlike the other domains, which are primarily about building systems, Domain 5 is about evaluating them.
The exam presents two types of Domain 5 scenarios: evaluation design (how would you measure whether this system is working?) and troubleshooting (this system is producing wrong/unsafe/slow outputs — what is the root cause and how do you fix it?). Both require systematic frameworks, not ad-hoc debugging.
⚠️ Common Misconception: ROUGE and BLEU scores are the standard evaluation metrics for LLM-based applications. ROUGE and BLEU measure n-gram overlap — they were designed for machine translation and summarization tasks where reference outputs exist. For open-ended GenAI applications, these metrics systematically undervalue paraphrased correct answers and overvalue fluent but wrong responses. Modern GenAI evaluation uses semantic metrics (RAGAS, BERTScore) and LLM-as-judge approaches.