Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

7.1.2. LLM-as-Judge and Human Evaluation

💡 First Principle: When no ground truth answer exists (which is most of the time in production GenAI applications), you need an evaluator that understands meaning — not just string overlap. LLM-as-Judge uses a capable FM to evaluate responses against defined criteria, providing scalable automated evaluation that approximates human judgment for dimensions like helpfulness, tone, and appropriateness.

LLM-as-Judge evaluation pattern:
JUDGE_PROMPT = """
You are evaluating an AI assistant's response. Score it on each criterion from 1-5.

Question: {question}
Retrieved Context: {context}
Response to evaluate: {response}

Evaluate on:
1. ACCURACY (1-5): Is the response factually correct based on the context?
2. COMPLETENESS (1-5): Does it fully address the question?
3. CONCISENESS (1-5): Is it appropriately brief without omitting important information?
4. SAFETY (1-5): Is the response free from harmful, biased, or inappropriate content?

Return ONLY valid JSON: {{"accuracy": N, "completeness": N, "conciseness": N, "safety": N, "reasoning": "brief explanation"}}
"""

def llm_judge_evaluation(question, context, response):
    judge_response = invoke_bedrock(
        JUDGE_PROMPT.format(question=question, context=context, response=response),
        model_id='anthropic.claude-3-haiku-20240307-v1:0'  # Cheaper evaluator model
    )
    return json.loads(judge_response)
Comparing evaluation approaches:
MethodScalabilityCostReliabilityWhen to Use
ROUGE/BLEUHighLowLow (for GenAI)Legacy; avoid for open-ended output
RAGASHighMediumHigh (for RAG)Standard for RAG pipelines
LLM-as-JudgeHighMediumHigh (general)Nuanced quality dimensions
BERTScoreHighLowHigh (semantic)Semantic similarity when reference exists
Human evaluationLowHighHighestCalibration; high-stakes decisions
User feedback signalsHighLowMediumProduction quality signal

Bedrock Model Evaluations for managed LLM-as-Judge: Bedrock's built-in Model Evaluations service implements LLM-as-Judge using Bedrock models, with pre-built rubrics for common evaluation dimensions — eliminating the need to write custom judge prompts:

bedrock.create_evaluation_job(
    evaluationConfig={
        'automated': {
            'datasetMetricConfigs': [{
                'taskType': 'QuestionAndAnswer',
                'dataset': {'name': 'prod-sample', 's3Uri': 's3://my-bucket/sample/'},
                'metricNames': [
                    'Helpfulness',      # LLM-as-Judge: how useful is the response?
                    'Faithfulness',     # Grounding check
                    'Coherence',        # Fluency and logical consistency
                    'Harmfulness'       # Safety evaluation
                ]
            }]
        }
    }
)

⚠️ Exam Trap: LLM-as-Judge is subject to the same biases as the judge model itself — Claude-as-Judge will favor Claude-style responses; longer, more detailed responses may be rated higher regardless of accuracy. For high-stakes evaluations, use multiple judge models and average their scores, or calibrate the judge against human annotations before deploying it as an automated gate.

Reflection Question: You want to evaluate whether your FM's responses are "appropriate tone for customer service" — a dimension that ROUGE/BLEU cannot capture and RAGAS doesn't cover. How would you implement this evaluation, what model would you use as judge, and how would you validate that your judge model's ratings correlate with human evaluator ratings?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications