7.1.1. RAG-Specific Evaluation: RAGAS Framework
💡 First Principle: RAG evaluation must independently assess the retrieval pipeline and the generation pipeline — a failure at either stage produces wrong answers, but the fix is different. RAGAS (Retrieval Augmented Generation Assessment) provides a framework with separate metrics for each stage.
RAGAS metric definitions:
| Metric | Measures | How It's Computed | Target |
|---|---|---|---|
| Faithfulness | Does the response contain only claims supported by retrieved context? | LLM checks each claim against context | >0.8 |
| Answer Relevancy | Does the response actually answer the question asked? | Embed response → similarity to original question | >0.75 |
| Context Precision | Are the retrieved chunks actually relevant to the question? | LLM evaluates each chunk's relevance | >0.7 |
| Context Recall | Does the retrieved context contain all info needed to answer? | Checks answer against ground truth | >0.7 |
| Answer Correctness | Is the factual content of the answer correct? | Semantic similarity to ground truth answer | >0.75 |
Diagnosing RAG failures with RAGAS metrics:
Implementing RAGAS evaluation in a Lambda function:
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
def run_ragas_evaluation(test_cases):
"""
test_cases: list of dicts with:
- question: str
- answer: str (FM output)
- contexts: list[str] (retrieved chunks)
- ground_truth: str (expected answer)
"""
dataset = Dataset.from_list(test_cases)
scores = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
# Publish to CloudWatch
for metric_name, score in scores.items():
cloudwatch.put_metric_data(
Namespace='GenAI/RAGAS',
MetricData=[{'MetricName': metric_name.capitalize(),
'Value': float(score), 'Unit': 'None'}]
)
return scores
⚠️ Exam Trap: RAGAS metrics use an LLM internally to score faithfulness and relevancy — this means RAGAS evaluation itself costs Bedrock tokens. For large evaluation sets (10,000+ test cases), the evaluation cost can be significant. Use a cheaper model (Claude Haiku) for the RAGAS evaluator LLM while using a more capable model for the application being evaluated.
Reflection Question: A RAGAS evaluation of your customer service RAG bot shows: Faithfulness = 0.91, Answer Relevancy = 0.88, Context Precision = 0.42, Context Recall = 0.79. Based on these scores alone, which component of your RAG pipeline has the most serious problem, and what specific change would you make to fix it?