6.1. Cost Optimization Strategies
💡 First Principle: FM cost is a function of tokens consumed, not time elapsed or requests made. Every optimization strategy ultimately reduces either the number of input tokens, the number of output tokens, or the price per token. Understanding which of these three levers applies to your specific cost driver determines which optimization technique to implement.
A GenAI application with a 2,000-token system prompt, 8,000-token retrieved context, and 500-token response pays for 10,500 tokens per query. If you're running 1 million queries per day at Claude 3 Sonnet pricing (~$0.003/1K input tokens + $0.015/1K output tokens), system prompt caching alone — by eliminating the 2,000-token system prompt cost on repeated calls — saves ~$1,800/day. This is a larger saving than switching from on-demand to provisioned throughput at any utilization level.
⚠️
| Cost Lever | Technique | Typical Saving | When It Applies |
|---|---|---|---|
| Reduce input tokens | Prompt caching for static system prompts | 50–90% on system prompt cost | Repeated static prefixes |
| Reduce input tokens | Tighter RAG (k=3 not k=10) | 20–40% on context cost | Oversized retrieval |
| Reduce output tokens | max_tokens ceiling + output format spec | 10–30% | Unbounded generation |
| Reduce price/token | Smaller model for simple tasks (routing) | 60–90% | Mixed complexity workloads |
| Defer non-urgent work | Bedrock Batch Inference (~50% discount) | 50% | Non-real-time processing |
| Cache semantic results | ElastiCache semantic cache | 30–70% on repeated queries | Repeated similar queries |
Common Misconception: Larger context windows always increase cost proportionally. Many models use tiered pricing where tokens beyond a threshold are cheaper, and some providers implement prompt caching at the API level. Always check current model-specific pricing before assuming linear cost scaling with context size.