Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.1.1. Prompt Caching and Token Optimization

💡 First Principle: Prompt caching eliminates the cost of re-processing static content — system prompts, few-shot examples, and large knowledge base extracts that are identical across many requests — by computing their token representations once and reusing the cached result for subsequent calls.

Amazon Bedrock prompt caching mechanics:

The Bedrock API supports prompt caching for content prefixes that are identical across calls. When the same system prompt (or document context) appears at the start of a request body, Bedrock detects the cache hit and skips re-processing those tokens:

# Prompt caching via cache_point markers in Converse API
response = bedrock_runtime.converse(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    system=[
        {'text': LONG_SYSTEM_PROMPT},           # 2,000 tokens — same every request
        {'cachePoint': {'type': 'default'}}      # Mark end of cacheable prefix
    ],
    messages=[
        {
            'role': 'user',
            'content': [
                {'text': LARGE_DOCUMENT_CONTEXT},  # 8,000 tokens of retrieved docs
                {'cachePoint': {'type': 'default'}},
                {'text': user_query}               # Only this varies per request
            ]
        }
    ]
)
# First call: pays full input token cost
# Subsequent calls with same system prompt + docs: pays only for user_query tokens
Token reduction techniques — ordered by impact:
TechniqueToken ReductionImplementation EffortRisk
Prompt caching50–90% of static tokensLow (add cache_point)Minimal
System prompt compression20–40%Medium (rewrite prompts)Output format drift
Context window trimming30–60%Medium (retrieval k tuning)Reduced answer quality
Summarize conversation history40–70% of history tokensMedium (extra FM call)History fidelity loss
Model downgrade for simple queries50–80% cost per tokenMedium (routing logic)Capability regression
Output length constraints20–50% of output tokensLow (max_tokens param)Truncated responses
Measuring token efficiency:
def log_token_efficiency(response, session_id):
    usage = response['usage']
    cloudwatch.put_metric_data(
        Namespace='GenAI/Cost',
        MetricData=[
            {'MetricName': 'InputTokens', 'Value': usage['inputTokens'], 'Unit': 'Count'},
            {'MetricName': 'OutputTokens', 'Value': usage['outputTokens'], 'Unit': 'Count'},
            {'MetricName': 'CacheReadTokens', 'Value': usage.get('cacheReadInputTokens', 0), 'Unit': 'Count'},
            {'MetricName': 'CostSavingsFromCache', 
             'Value': usage.get('cacheReadInputTokens', 0) * 0.000003,  # $ saved
             'Unit': 'None'}
        ],
        Dimensions=[{'Name': 'SessionId', 'Value': session_id}]
    )

⚠️ Exam Trap: Prompt caching is only effective when the cached prefix is identical across requests. If you include the current timestamp, a session ID, or any variable content in the system prompt, the cache never hits. Audit your system prompts for accidental variable content before expecting caching savings.

Reflection Question: Your system prompt is 3,000 tokens and includes a daily-refreshed list of product prices (updated every morning). You process 500,000 queries per day. Calculate the daily cache-miss rate and the token waste from the daily price list update, then propose an architectural change that preserves price accuracy while maximizing cache hit rate.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications