Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

7.1.1. RAG-Specific Evaluation: RAGAS Framework

💡 First Principle: RAG evaluation must independently assess the retrieval pipeline and the generation pipeline — a failure at either stage produces wrong answers, but the fix is different. RAGAS (Retrieval Augmented Generation Assessment) provides a framework with separate metrics for each stage.

RAGAS metric definitions:
MetricMeasuresHow It's ComputedTarget
FaithfulnessDoes the response contain only claims supported by retrieved context?LLM checks each claim against context>0.8
Answer RelevancyDoes the response actually answer the question asked?Embed response → similarity to original question>0.75
Context PrecisionAre the retrieved chunks actually relevant to the question?LLM evaluates each chunk's relevance>0.7
Context RecallDoes the retrieved context contain all info needed to answer?Checks answer against ground truth>0.7
Answer CorrectnessIs the factual content of the answer correct?Semantic similarity to ground truth answer>0.75
Diagnosing RAG failures with RAGAS metrics:
Implementing RAGAS evaluation in a Lambda function:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

def run_ragas_evaluation(test_cases):
    """
    test_cases: list of dicts with:
      - question: str
      - answer: str (FM output)
      - contexts: list[str] (retrieved chunks)
      - ground_truth: str (expected answer)
    """
    dataset = Dataset.from_list(test_cases)
    
    scores = evaluate(
        dataset,
        metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
    )
    
    # Publish to CloudWatch
    for metric_name, score in scores.items():
        cloudwatch.put_metric_data(
            Namespace='GenAI/RAGAS',
            MetricData=[{'MetricName': metric_name.capitalize(), 
                         'Value': float(score), 'Unit': 'None'}]
        )
    
    return scores

⚠️ Exam Trap: RAGAS metrics use an LLM internally to score faithfulness and relevancy — this means RAGAS evaluation itself costs Bedrock tokens. For large evaluation sets (10,000+ test cases), the evaluation cost can be significant. Use a cheaper model (Claude Haiku) for the RAGAS evaluator LLM while using a more capable model for the application being evaluated.

Reflection Question: A RAGAS evaluation of your customer service RAG bot shows: Faithfulness = 0.91, Answer Relevancy = 0.88, Context Precision = 0.42, Context Recall = 0.79. Based on these scores alone, which component of your RAG pipeline has the most serious problem, and what specific change would you make to fix it?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications