7.1. Evaluation Frameworks and Quality Assessment
💡 First Principle: Every evaluation framework must answer three separate questions: Is the output factually correct? Is it faithful to the provided context? Is it useful to the user? These three questions require different evaluation techniques — factual correctness requires ground truth comparison, faithfulness requires context grounding checks, and usefulness often requires human evaluation or LLM-as-judge.
No single metric captures all three. Applications that optimize for one metric while ignoring the others systematically fail in the dimension they don't measure — a model fine-tuned to maximize ROUGE scores on a summary task will learn to output the same sentences that appear in reference summaries, rather than to produce maximally useful paraphrases.
⚠️ Common Misconception: A high accuracy score on a golden question-answer dataset means the system will perform well in production. Production inputs are drawn from a completely different distribution than golden datasets — users ask questions the golden dataset doesn't cover, phrase things unexpectedly, and pursue goals that test cases didn't anticipate. Evaluation must include adversarial testing, edge cases, and real production query sampling.
| What You're Testing | Metric | How It's Measured | Failure Mode It Catches |
|---|---|---|---|
| Factual correctness | Accuracy vs. ground truth | Exact match, ROUGE (text), human eval | FM inventing facts |
| Context faithfulness | Faithfulness (RAGAS) | LLM-as-judge against retrieved context | FM ignoring or contradicting retrieved docs |
| Retrieval quality | Context Precision + Recall (RAGAS) | Retrieved chunks vs. relevant chunks | Wrong chunks retrieved, relevant chunks missed |
| Answer usefulness | Answer Relevancy (RAGAS) | Semantic similarity to question | Technically correct but unhelpful answers |
| Safety | Policy compliance rate | Guardrails block rate on adversarial inputs | Harmful outputs reaching users |