3.4.2. Automated Metrics: ROUGE and BLEU
First Principle: When human evaluation is too slow or expensive for every iteration, automated text-overlap metrics provide a cheap, scalable proxy — with important limitations you must understand.
Think of ROUGE and BLEU like a spell-checker for content quality: they can tell you whether the right words are present, but not whether the meaning is correct, the tone is appropriate, or the answer is genuinely helpful.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) — used for summarisation:
- ROUGE-1: Measures overlap of individual words (unigrams) between the generated and reference summary.
- ROUGE-2: Measures overlap of two-word phrases (bigrams) — captures some phrase structure.
- ROUGE-L: Measures the longest common subsequence — captures sentence-level structure.
- Recall-oriented: It measures how much of the reference summary's content appears in the generated output. High recall = the summary covers the same ground as the reference.
BLEU (Bilingual Evaluation Understudy) — used for machine translation:
- Measures precision: how many n-grams in the generated translation appear in the human reference.
- Produces a score from 0 to 1; higher is better.
- Penalises outputs that are too short (brevity penalty).
Why both metrics have limitations:
- A model can score highly on ROUGE/BLEU while still being factually wrong or incoherent.
- Valid paraphrases (same meaning, different words) are penalised even if they are better than the reference.
- Neither metric understands meaning — they count matching strings.
Scenario: A news summarisation model scores ROUGE-1 = 0.72 on a test set. The team celebrates, but when human reviewers read 50 summaries, they find 30% contain subtle factual errors not caught by ROUGE. The metric measured word overlap; it could not detect fabricated facts that used real words from the source article.
Reflection Question: If ROUGE cannot detect hallucinations, what role should it play in a production evaluation pipeline? How would you design an evaluation strategy that catches both coverage gaps (ROUGE) and factual errors (human review)?
⚠️ Exam Tip: ROUGE → summarisation (recall-oriented, measures coverage of reference content). BLEU → translation (precision-oriented, measures n-gram match). Both measure word overlap only — neither measures factual accuracy, coherence, or true quality. The exam tests these distinctions directly.