Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.2.1. Retrieval Performance

💡 First Principle: Vector search latency is dominated by index scan time, which scales with corpus size and the k-NN algorithm parameters. The optimization hierarchy for retrieval is: algorithm selection → index parameters → hardware scaling → query optimization → caching — in that order, because earlier optimizations have higher leverage.

Retrieval latency optimization levers:
LeverLatency ImpactTrade-offImplementation
HNSW ef_search parameter30–60% reductionLower = faster but worse recallTune per use case; start with 512, reduce until recall degrades
Reduce retrieval kLinear with kFewer chunks = less contextTest quality impact at k=3,5,10
Metadata pre-filter40–80% reductionRequires well-designed metadata schemaAdd department/date filters before vector search
Semantic cache95%+ reduction for cache hitsCache staleness riskCache FAQ-type queries
OpenSearch replicas30–50% under read loadCost increaseAdd replicas when CPU > 70%
Multi-AZ OpenSearchAvailability, not latencyCostFor HA, not performance
Measuring retrieval vs. generation latency with X-Ray:
from aws_xray_sdk.core import xray_recorder

@xray_recorder.capture('rag_pipeline')
def handle_query(user_query):
    with xray_recorder.in_subsegment('retrieval') as subseg:
        chunks = retrieve_from_knowledge_base(user_query)
        subseg.put_metadata('chunks_retrieved', len(chunks))
        subseg.put_metadata('top_score', chunks[0]['score'] if chunks else 0)
    
    with xray_recorder.in_subsegment('generation') as subseg:
        response = invoke_bedrock(user_query, chunks)
        subseg.put_metadata('output_tokens', response['usage']['outputTokens'])
    
    return response
# X-Ray Service Map shows retrieval vs. generation time separately
# Tells you exactly which subsystem to optimize
Metadata filtering for pre-query scope reduction:
# Filter by metadata BEFORE vector search — dramatically reduces scan size
retrieval_config = {
    'vectorSearchConfiguration': {
        'numberOfResults': 5,
        'filter': {
            'andAll': [
                {'equals': {'key': 'department', 'value': user_department}},
                {'greaterThan': {'key': 'effective_date', 'value': '2024-01-01'}},
                {'notEquals': {'key': 'status', 'value': 'archived'}}
            ]
        }
    }
}

⚠️ Exam Trap: Metadata filtering reduces the scan space for vector search, which speeds up queries. However, over-aggressive filtering can cause zero results to be returned (all documents filtered out), causing the retrieval step to return empty context — which the FM then uses to hallucinate an answer. Always include a fallback: if filtered retrieval returns 0 results, retry without the filter and log the miss.

Reflection Question: CloudWatch shows your RAG application's P99 latency is 14 seconds. X-Ray traces show: retrieval subsegment = 0.8 seconds, generation subsegment = 13.2 seconds. What optimization strategies are available, and which layer should you focus on first?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications