Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.2. Performance Optimization

💡 First Principle: FM application latency has two distinct components: retrieval latency (the time to find relevant context) and generation latency (the time for the FM to produce output). They require completely different optimization techniques — retrieval is a database problem; generation is a model inference problem. Confusing them leads to optimizing the wrong layer.

The exam tests this distinction by presenting scenarios where candidates must diagnose whether a latency problem is in the retrieval pipeline or the generation pipeline and select the correct optimization. A 12-second P99 latency that's 11 seconds of FM generation cannot be fixed by adding OpenSearch replicas — and vice versa.

⚠️

Latency ComponentTypical RangeOptimizationWrong Fix
Retrieval (vector search)50–500msOpenSearch provisioned replicas, ef_search tuningSwitching to larger FM
FM generation500ms–30sStreaming, smaller model, output token ceilingAdding OpenSearch replicas
Lambda cold start100ms–2sProvisioned concurrencyIncreasing FM max_tokens
API Gateway overhead10–50msHTTP API (not REST API)Any FM-level change

Common Misconception: Adding more context always improves answer quality and is worth the latency cost. Diminishing returns on context kick in quickly — studies show that adding context beyond the 10 most relevant chunks rarely improves answer quality while proportionally increasing generation latency. Always measure the quality-latency trade-off at different k values.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications