3.4.1. Human Evaluation and Benchmark Datasets
First Principle: When you need to know if a model is genuinely good — not just statistically similar to a reference — you need human judgment and standardised comparison.
Imagine building a customer service chatbot. Automated metrics can tell you whether responses contain similar words to a reference answer. They cannot tell you whether the response was actually helpful, appropriately empathetic, or free of subtle factual errors. Only a human rater can assess those dimensions.
Human Evaluation involves trained raters assessing model outputs against criteria like relevance, fluency, factual accuracy, tone, and task completion. It is the gold standard — but it is slow (hours to days per evaluation round), expensive, and introduces inter-rater variability (two evaluators may score the same output differently).
Benchmark Datasets are standardised, publicly released test sets that all models can be evaluated on for direct comparison:
- MMLU (Massive Multitask Language Understanding): Tests knowledge across 57 academic subjects.
- HumanEval: Tests coding ability on programming problems.
- GLUE / SuperGLUE: Test language understanding and reasoning.
Strength: Benchmarks enable reproducible, objective side-by-side model comparisons — invaluable when selecting a foundation model for a project.
Critical limitation: Strong benchmark performance does not guarantee strong performance on your specific use case. A model that tops the MMLU leaderboard may still fail at answering questions about your proprietary internal documentation. Always validate on a representative sample of your own data before committing.
Scenario: A team must choose between three foundation models for a legal document Q&A system. They run benchmark comparisons on MMLU for baseline intelligence, then evaluate all three on 100 sample questions from their actual legal corpus with human lawyer reviewers scoring the answers. Only the second step reveals which model actually works for their use case.
Reflection Question: When is benchmark evaluation sufficient, and when is human evaluation essential? Consider: what does MMLU measure, and what does it not measure about your specific deployment context?
⚠️ Exam Tip: Benchmark datasets enable comparison between models. Human evaluation assesses actual quality for a specific task. Benchmarks do not guarantee production performance. The exam will test whether you know when each approach is appropriate.