Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.2.2. Generation Performance

💡 First Principle: FM generation latency is proportional to output token count, not input token count. A 100-token query with a 2,000-token response takes much longer than a 2,000-token query with a 100-token response. Generation optimization focuses on reducing output token count and making the tokens that do appear deliver maximum value per token.

Generation latency optimization strategies:
StrategyImpactHow
StreamingReduces perceived latency to <500ms first tokenInvokeModelWithResponseStream
Output length constraintsProportional reductionSet max_tokens to realistic ceiling; instruct model to be concise
Model selection3–10x difference between tiersClaude 3 Haiku vs. Sonnet: Haiku ~3x faster
Provisioned throughputEliminates cold-start variationReserved capacity = consistent latency floor
Reduced input tokensFrees capacity for faster generationShorter context = more headroom per request
Parallel invocationsFor decomposed queriesUse asyncio or Lambda concurrency for independent sub-queries
Parallel FM invocations with asyncio:
import asyncio, aioboto3

async def parallel_invoke(queries: list[str], model_id: str) -> list[str]:
    """Invoke Bedrock in parallel for independent sub-queries."""
    async with aioboto3.Session().client('bedrock-runtime') as client:
        tasks = [invoke_async(client, q, model_id) for q in queries]
        results = await asyncio.gather(*tasks, return_exceptions=True)
    return [r if not isinstance(r, Exception) else None for r in results]

# For a query decomposed into 3 independent sub-queries:
# Sequential: 3 × 4s = 12s total
# Parallel: max(4s, 4s, 4s) = 4s total — 3x speedup

Speculative decoding and model distillation (SageMaker-hosted models): For custom models deployed on SageMaker, speculative decoding uses a small draft model to predict multiple tokens ahead, then verifies with the full model — reducing generation time while maintaining quality. This is a SageMaker-level optimization, not available directly on Bedrock managed models.

⚠️ Exam Trap: Reducing max_tokens does not reduce latency if the FM stops generating naturally before that limit. max_tokens is a ceiling, not a target. If your FM typically generates 300-token responses, setting max_tokens=1000 versus max_tokens=350 makes no practical difference in most calls — but setting max_tokens=100 would truncate responses. The right approach is max_tokens set to a realistic maximum plus 20% headroom.

Reflection Question: Your FM chat interface shows users a blank screen for 8–12 seconds before any response appears. Users complain the app "feels broken." The generation quality is fine — only the perceived responsiveness is the problem. What single architectural change addresses this with the least engineering effort?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications