6.2. Performance Optimization
💡 First Principle: FM application latency has two distinct components: retrieval latency (the time to find relevant context) and generation latency (the time for the FM to produce output). They require completely different optimization techniques — retrieval is a database problem; generation is a model inference problem. Confusing them leads to optimizing the wrong layer.
The exam tests this distinction by presenting scenarios where candidates must diagnose whether a latency problem is in the retrieval pipeline or the generation pipeline and select the correct optimization. A 12-second P99 latency that's 11 seconds of FM generation cannot be fixed by adding OpenSearch replicas — and vice versa.
⚠️
| Latency Component | Typical Range | Optimization | Wrong Fix |
|---|---|---|---|
| Retrieval (vector search) | 50–500ms | OpenSearch provisioned replicas, ef_search tuning | Switching to larger FM |
| FM generation | 500ms–30s | Streaming, smaller model, output token ceiling | Adding OpenSearch replicas |
| Lambda cold start | 100ms–2s | Provisioned concurrency | Increasing FM max_tokens |
| API Gateway overhead | 10–50ms | HTTP API (not REST API) | Any FM-level change |
Common Misconception: Adding more context always improves answer quality and is worth the latency cost. Diminishing returns on context kick in quickly — studies show that adding context beyond the 10 most relevant chunks rarely improves answer quality while proportionally increasing generation latency. Always measure the quality-latency trade-off at different k values.