4.3.4. Intelligent Model Routing
💡 First Principle: Not every query deserves the most capable (and most expensive) model. Intelligent routing directs simple queries to cheap, fast models and complex queries to capable models — cutting costs 60–90% without degrading quality for the majority of requests that are inherently simple.
Routing strategies and their implementations:
| Strategy | Logic | Implementation | Best For |
|---|---|---|---|
| Static | Always use model X | AppConfig config value | Single-model applications |
| Content-based | Route based on query classification | Lambda classifier → model selection | Mixed complexity workloads |
| Complexity-based | Estimate query complexity, route accordingly | Token count + keyword signals | Cost optimization |
| Latency-based | Route to fastest available model meeting SLA | CloudWatch metrics → routing logic | Latency-SLA workloads |
| Cost-based | Route to cheapest model meeting quality threshold | A/B eval results → routing policy | Cost-optimized pipelines |
Content-based routing with Step Functions:
# Step Functions: route based on FM-classified query complexity
{
"Classify Query Complexity": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:classify-query",
# Fast classification using Haiku (cheap, fast)
"Next": "Route by Complexity"
},
"Route by Complexity": {
"Type": "Choice",
"Choices": [
{"Variable": "$.complexity", "StringEquals": "simple",
"Next": "Invoke Haiku"}, # Simple Q&A, extraction
{"Variable": "$.complexity", "StringEquals": "moderate",
"Next": "Invoke Sonnet"}, # Analysis, summarization
{"Variable": "$.complexity", "StringEquals": "complex",
"Next": "Invoke Opus"} # Multi-step reasoning, code
],
"Default": "Invoke Sonnet"
}
}
API Gateway request transformation for routing: API Gateway can inspect request headers and route to different Lambda functions (which invoke different models) without application code changes:
# API Gateway routing rules via request parameters
x-model-tier: "premium" # → Invoke Opus
x-model-tier: "standard" # → Invoke Sonnet
x-model-tier: "economy" # → Invoke Haiku
# Default (no header) → Invoke Sonnet
Model cascade pattern — attempt cheapest model first, escalate on low confidence:
def cascade_invoke(query, quality_threshold=0.8):
# Start with cheapest model
result = invoke_haiku(query)
confidence = score_confidence(result)
if confidence >= quality_threshold:
return result, 'haiku'
# Escalate to Sonnet
result = invoke_sonnet(query)
confidence = score_confidence(result)
if confidence >= quality_threshold:
return result, 'sonnet'
# Final escalation to Opus
return invoke_opus(query), 'opus'
⚠️ Exam Trap: The classification step in content-based routing itself costs tokens. If your classifier uses Claude 3 Haiku to classify a query before routing it to Claude 3 Haiku (because it's simple), you've doubled the cost with no quality benefit. Keep classifiers lightweight — rule-based heuristics (token count, keyword presence, query length) are often sufficient and cost zero additional FM invocations.
Reflection Question: Your application processes three types of requests: (1) "Is product X in stock?" — factual lookup that requires a database tool call, (2) "Summarize this 50-page report" — content processing, (3) "Analyze our competitive position and recommend strategy" — complex multi-step reasoning. Design a routing policy that minimizes cost while maintaining quality for each type.