2.2.3. Resilient FM Systems and Graceful Degradation
💡 First Principle: Foundation model APIs fail — they throttle under load, have regional outages, return malformed responses, and time out on long-context requests. A production FM system must treat the model API as an unreliable external dependency and build resilience at every layer, not just at the API call level.
Failure modes and mitigations:
| Failure Mode | Symptom | Mitigation |
|---|---|---|
| Throttling (429) | Request rejected at rate limit | Exponential backoff with jitter; SQS buffer for async |
| Timeout | No response within SLA | Retry with shorter context; fallback to cached response |
| Model unavailability | Service disruption in region | Cross-region inference; fallback to alternative model |
| Malformed response | JSON parse error on FM output | Structured output enforcement; retry with format instructions |
| Context overflow | Input exceeds context window | Pre-truncate input; summarize conversation history |
The circuit breaker pattern with Step Functions:
Cross-region fallback implementation: When the primary region's model is unavailable, cross-region inference profiles automatically route to the nearest available region. For custom failover logic (switching model providers entirely), Step Functions orchestrates the retry and fallback sequence:
# Step Functions workflow triggers fallback on primary model failure
{
"Try primary model": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:invoke-primary-model",
"Catch": [{
"ErrorEquals": ["ThrottlingException", "ServiceUnavailableException"],
"Next": "Invoke fallback model"
}]
},
"Invoke fallback model": {
"Type": "Task",
"Resource": "arn:aws:lambda:::function:invoke-fallback-model"
}
}
⚠️ Exam Trap: Cross-region inference for resilience and cross-region inference for model availability are the same feature used for different reasons. Exam questions often describe a scenario where a model is "only available in us-east-1" — the answer is cross-region inference profiles, not deploying a separate Bedrock stack in each region.
Reflection Question: Your FM application has an SLA of 99.9% availability. The primary model (Claude 3 Sonnet in us-east-1) had a 45-minute outage last quarter. What architectural components must you add, and what AWS services implement the automatic failover?