3.4. Describing Methods to Evaluate Foundation Model Performance
Think of evaluating a generative AI model like judging a chef rather than a calculator. A calculator is either right or wrong — but a chef's output requires multiple dimensions of assessment: expert tasters, standardised recipe tests, and objective measurements. No single number captures quality.
Unlike traditional ML where predictions are binary (correct/incorrect), generative AI outputs are open-ended and context-dependent. A summary can be factually accurate but poorly written; a translation grammatically correct but tonally wrong. Because of this, robust evaluation always combines multiple approaches.
What breaks without proper evaluation? Models deployed without rigorous evaluation may hallucinate confidently in production, perform well on benchmarks but fail on real company data, or degrade over time — with no visibility into the problem until users complain.