6.1.1. Prompt Caching and Token Optimization
💡 First Principle: Prompt caching eliminates the cost of re-processing static content — system prompts, few-shot examples, and large knowledge base extracts that are identical across many requests — by computing their token representations once and reusing the cached result for subsequent calls.
Amazon Bedrock prompt caching mechanics:
The Bedrock API supports prompt caching for content prefixes that are identical across calls. When the same system prompt (or document context) appears at the start of a request body, Bedrock detects the cache hit and skips re-processing those tokens:
# Prompt caching via cache_point markers in Converse API
response = bedrock_runtime.converse(
modelId='anthropic.claude-3-sonnet-20240229-v1:0',
system=[
{'text': LONG_SYSTEM_PROMPT}, # 2,000 tokens — same every request
{'cachePoint': {'type': 'default'}} # Mark end of cacheable prefix
],
messages=[
{
'role': 'user',
'content': [
{'text': LARGE_DOCUMENT_CONTEXT}, # 8,000 tokens of retrieved docs
{'cachePoint': {'type': 'default'}},
{'text': user_query} # Only this varies per request
]
}
]
)
# First call: pays full input token cost
# Subsequent calls with same system prompt + docs: pays only for user_query tokens
Token reduction techniques — ordered by impact:
| Technique | Token Reduction | Implementation Effort | Risk |
|---|---|---|---|
| Prompt caching | 50–90% of static tokens | Low (add cache_point) | Minimal |
| System prompt compression | 20–40% | Medium (rewrite prompts) | Output format drift |
| Context window trimming | 30–60% | Medium (retrieval k tuning) | Reduced answer quality |
| Summarize conversation history | 40–70% of history tokens | Medium (extra FM call) | History fidelity loss |
| Model downgrade for simple queries | 50–80% cost per token | Medium (routing logic) | Capability regression |
| Output length constraints | 20–50% of output tokens | Low (max_tokens param) | Truncated responses |
Measuring token efficiency:
def log_token_efficiency(response, session_id):
usage = response['usage']
cloudwatch.put_metric_data(
Namespace='GenAI/Cost',
MetricData=[
{'MetricName': 'InputTokens', 'Value': usage['inputTokens'], 'Unit': 'Count'},
{'MetricName': 'OutputTokens', 'Value': usage['outputTokens'], 'Unit': 'Count'},
{'MetricName': 'CacheReadTokens', 'Value': usage.get('cacheReadInputTokens', 0), 'Unit': 'Count'},
{'MetricName': 'CostSavingsFromCache',
'Value': usage.get('cacheReadInputTokens', 0) * 0.000003, # $ saved
'Unit': 'None'}
],
Dimensions=[{'Name': 'SessionId', 'Value': session_id}]
)
⚠️ Exam Trap: Prompt caching is only effective when the cached prefix is identical across requests. If you include the current timestamp, a session ID, or any variable content in the system prompt, the cache never hits. Audit your system prompts for accidental variable content before expecting caching savings.
Reflection Question: Your system prompt is 3,000 tokens and includes a daily-refreshed list of product prices (updated every morning). You process 500,000 queries per day. Calculate the daily cache-miss rate and the token waste from the daily price list update, then propose an architectural change that preserves price accuracy while maximizing cache hit rate.