Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.3.4. Intelligent Model Routing

💡 First Principle: Not every query deserves the most capable (and most expensive) model. Intelligent routing directs simple queries to cheap, fast models and complex queries to capable models — cutting costs 60–90% without degrading quality for the majority of requests that are inherently simple.

Routing strategies and their implementations:
StrategyLogicImplementationBest For
StaticAlways use model XAppConfig config valueSingle-model applications
Content-basedRoute based on query classificationLambda classifier → model selectionMixed complexity workloads
Complexity-basedEstimate query complexity, route accordinglyToken count + keyword signalsCost optimization
Latency-basedRoute to fastest available model meeting SLACloudWatch metrics → routing logicLatency-SLA workloads
Cost-basedRoute to cheapest model meeting quality thresholdA/B eval results → routing policyCost-optimized pipelines
Content-based routing with Step Functions:
# Step Functions: route based on FM-classified query complexity
{
    "Classify Query Complexity": {
        "Type": "Task",
        "Resource": "arn:aws:lambda:::function:classify-query",
        # Fast classification using Haiku (cheap, fast)
        "Next": "Route by Complexity"
    },
    "Route by Complexity": {
        "Type": "Choice",
        "Choices": [
            {"Variable": "$.complexity", "StringEquals": "simple", 
             "Next": "Invoke Haiku"},    # Simple Q&A, extraction
            {"Variable": "$.complexity", "StringEquals": "moderate",
             "Next": "Invoke Sonnet"},   # Analysis, summarization
            {"Variable": "$.complexity", "StringEquals": "complex",
             "Next": "Invoke Opus"}      # Multi-step reasoning, code
        ],
        "Default": "Invoke Sonnet"
    }
}

API Gateway request transformation for routing: API Gateway can inspect request headers and route to different Lambda functions (which invoke different models) without application code changes:

# API Gateway routing rules via request parameters
x-model-tier: "premium"  # → Invoke Opus
x-model-tier: "standard" # → Invoke Sonnet  
x-model-tier: "economy"  # → Invoke Haiku
# Default (no header) → Invoke Sonnet

Model cascade pattern — attempt cheapest model first, escalate on low confidence:

def cascade_invoke(query, quality_threshold=0.8):
    # Start with cheapest model
    result = invoke_haiku(query)
    confidence = score_confidence(result)
    
    if confidence >= quality_threshold:
        return result, 'haiku'
    
    # Escalate to Sonnet
    result = invoke_sonnet(query)
    confidence = score_confidence(result)
    
    if confidence >= quality_threshold:
        return result, 'sonnet'
    
    # Final escalation to Opus
    return invoke_opus(query), 'opus'

⚠️ Exam Trap: The classification step in content-based routing itself costs tokens. If your classifier uses Claude 3 Haiku to classify a query before routing it to Claude 3 Haiku (because it's simple), you've doubled the cost with no quality benefit. Keep classifiers lightweight — rule-based heuristics (token count, keyword presence, query length) are often sufficient and cost zero additional FM invocations.

Reflection Question: Your application processes three types of requests: (1) "Is product X in stock?" — factual lookup that requires a database tool call, (2) "Summarize this 50-page report" — content processing, (3) "Analyze our competitive position and recommend strategy" — complex multi-step reasoning. Design a routing policy that minimizes cost while maintaining quality for each type.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications