6.2.1. Retrieval Performance
💡 First Principle: Vector search latency is dominated by index scan time, which scales with corpus size and the k-NN algorithm parameters. The optimization hierarchy for retrieval is: algorithm selection → index parameters → hardware scaling → query optimization → caching — in that order, because earlier optimizations have higher leverage.
Retrieval latency optimization levers:
| Lever | Latency Impact | Trade-off | Implementation |
|---|---|---|---|
| HNSW ef_search parameter | 30–60% reduction | Lower = faster but worse recall | Tune per use case; start with 512, reduce until recall degrades |
| Reduce retrieval k | Linear with k | Fewer chunks = less context | Test quality impact at k=3,5,10 |
| Metadata pre-filter | 40–80% reduction | Requires well-designed metadata schema | Add department/date filters before vector search |
| Semantic cache | 95%+ reduction for cache hits | Cache staleness risk | Cache FAQ-type queries |
| OpenSearch replicas | 30–50% under read load | Cost increase | Add replicas when CPU > 70% |
| Multi-AZ OpenSearch | Availability, not latency | Cost | For HA, not performance |
Measuring retrieval vs. generation latency with X-Ray:
from aws_xray_sdk.core import xray_recorder
@xray_recorder.capture('rag_pipeline')
def handle_query(user_query):
with xray_recorder.in_subsegment('retrieval') as subseg:
chunks = retrieve_from_knowledge_base(user_query)
subseg.put_metadata('chunks_retrieved', len(chunks))
subseg.put_metadata('top_score', chunks[0]['score'] if chunks else 0)
with xray_recorder.in_subsegment('generation') as subseg:
response = invoke_bedrock(user_query, chunks)
subseg.put_metadata('output_tokens', response['usage']['outputTokens'])
return response
# X-Ray Service Map shows retrieval vs. generation time separately
# Tells you exactly which subsystem to optimize
Metadata filtering for pre-query scope reduction:
# Filter by metadata BEFORE vector search — dramatically reduces scan size
retrieval_config = {
'vectorSearchConfiguration': {
'numberOfResults': 5,
'filter': {
'andAll': [
{'equals': {'key': 'department', 'value': user_department}},
{'greaterThan': {'key': 'effective_date', 'value': '2024-01-01'}},
{'notEquals': {'key': 'status', 'value': 'archived'}}
]
}
}
}
⚠️ Exam Trap: Metadata filtering reduces the scan space for vector search, which speeds up queries. However, over-aggressive filtering can cause zero results to be returned (all documents filtered out), causing the retrieval step to return empty context — which the FM then uses to hallucinate an answer. Always include a fallback: if filtered retrieval returns 0 results, retry without the filter and log the miss.
Reflection Question: CloudWatch shows your RAG application's P99 latency is 14 seconds. X-Ray traces show: retrieval subsegment = 0.8 seconds, generation subsegment = 13.2 seconds. What optimization strategies are available, and which layer should you focus on first?