4.3.3. Resilience: Retry, Backoff, and Fallback
💡 First Principle: Bedrock throttling is not a bug — it's a rate limit protecting shared infrastructure. Your application's response to throttling determines whether a capacity constraint causes a brief slowdown or a complete service outage. Exponential backoff with jitter is the minimum; a fallback model is the production standard.
Throttling response hierarchy:
| Level | Response | When |
|---|---|---|
| Immediate retry | 0ms delay — only for transient errors (503) | Service temporarily unavailable |
| Exponential backoff | Delay = min(base × 2^attempt + jitter, max_delay) | Throttling (429), capacity (503) |
| Fallback model | Switch to alternative model (different tier or region) | Extended throttling, SLA breach |
| Circuit break | Stop sending requests, serve cached/degraded response | Sustained outage |
| Queue buffer | Accept and queue requests, process when capacity available | Batch/async workloads |
Exponential backoff with jitter implementation:
import random, time
from botocore.exceptions import ClientError
def invoke_with_retry(payload, model_id, max_attempts=5):
base_delay = 1.0 # seconds
max_delay = 60.0 # seconds cap
for attempt in range(max_attempts):
try:
return bedrock_runtime.invoke_model(
modelId=model_id,
body=json.dumps(payload)
)
except ClientError as e:
error_code = e.response['Error']['Code']
if error_code in ('ThrottlingException', 'ServiceUnavailableException'):
if attempt == max_attempts - 1:
raise # Final attempt — propagate error
# Exponential backoff with full jitter
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay) # Full jitter prevents thundering herd
time.sleep(jitter)
elif error_code == 'ValidationException':
raise # Don't retry validation errors — they won't self-heal
else:
raise
Fallback model pattern with AWS SDK:
PRIMARY_MODEL = 'anthropic.claude-3-sonnet-20240229-v1:0'
FALLBACK_MODEL = 'anthropic.claude-3-haiku-20240307-v1:0'
def invoke_with_fallback(payload):
try:
return invoke_with_retry(payload, PRIMARY_MODEL)
except ClientError as e:
if e.response['Error']['Code'] in ('ThrottlingException', 'ModelStreamErrorException'):
# Log the fallback event for monitoring
cloudwatch.put_metric_data(
Namespace='GenAI/Application',
MetricData=[{'MetricName': 'FallbackModelInvocations', 'Value': 1, 'Unit': 'Count'}]
)
return invoke_with_retry(payload, FALLBACK_MODEL)
raise
X-Ray tracing across service boundaries:
from aws_xray_sdk.core import xray_recorder, patch_all
patch_all() # Auto-instrument all boto3 calls
@xray_recorder.capture('invoke_bedrock')
def invoke_bedrock(payload, model_id):
xray_recorder.current_subsegment().put_annotation('model_id', model_id)
xray_recorder.current_subsegment().put_metadata('token_count', estimate_tokens(payload))
return bedrock_runtime.invoke_model(modelId=model_id, body=json.dumps(payload))
⚠️ Exam Trap: Exponential backoff guarantees eventual processing only when the throttling is temporary. Under sustained high load, backoff without a maximum retry limit will queue requests indefinitely, exhausting Lambda concurrency and causing cascading failures. Always pair backoff with a maximum attempt count and a circuit breaker.
Reflection Question: At 9am Monday, your application's Bedrock invocations spike 10x as the business day starts. You observe 429 ThrottlingExceptions. Your retry logic retries with exponential backoff up to 5 times. After 5 attempts, the request fails. What three architectural changes would prevent this failure pattern?