6.2.2. Generation Performance
💡 First Principle: FM generation latency is proportional to output token count, not input token count. A 100-token query with a 2,000-token response takes much longer than a 2,000-token query with a 100-token response. Generation optimization focuses on reducing output token count and making the tokens that do appear deliver maximum value per token.
Generation latency optimization strategies:
| Strategy | Impact | How |
|---|---|---|
| Streaming | Reduces perceived latency to <500ms first token | InvokeModelWithResponseStream |
| Output length constraints | Proportional reduction | Set max_tokens to realistic ceiling; instruct model to be concise |
| Model selection | 3–10x difference between tiers | Claude 3 Haiku vs. Sonnet: Haiku ~3x faster |
| Provisioned throughput | Eliminates cold-start variation | Reserved capacity = consistent latency floor |
| Reduced input tokens | Frees capacity for faster generation | Shorter context = more headroom per request |
| Parallel invocations | For decomposed queries | Use asyncio or Lambda concurrency for independent sub-queries |
Parallel FM invocations with asyncio:
import asyncio, aioboto3
async def parallel_invoke(queries: list[str], model_id: str) -> list[str]:
"""Invoke Bedrock in parallel for independent sub-queries."""
async with aioboto3.Session().client('bedrock-runtime') as client:
tasks = [invoke_async(client, q, model_id) for q in queries]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [r if not isinstance(r, Exception) else None for r in results]
# For a query decomposed into 3 independent sub-queries:
# Sequential: 3 × 4s = 12s total
# Parallel: max(4s, 4s, 4s) = 4s total — 3x speedup
Speculative decoding and model distillation (SageMaker-hosted models): For custom models deployed on SageMaker, speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies with the full model — reducing generation time while maintaining quality. This is a SageMaker-level optimization, not available directly on Bedrock managed models.
⚠️ Exam Trap: Reducing max_tokens does not reduce latency if the FM stops generating naturally before that limit. max_tokens is a ceiling, not a target. If your FM typically generates 300-token responses, setting max_tokens=1000 versus max_tokens=350 makes no practical difference in most calls — but setting max_tokens=100 would truncate responses. The right approach is max_tokens set to a realistic maximum plus 20% headroom.
Reflection Question: Your FM chat interface shows users a blank screen for 8–12 seconds before any response appears. Users complain the app "feels broken." The generation quality is fine — only the perceived responsiveness is the problem. What single architectural change addresses this with the least engineering effort?