2.2.1. Deployment Strategies: On-Demand, Provisioned, and SageMaker Endpoints
💡 First Principle: The three FM deployment modes — on-demand Bedrock, Bedrock provisioned throughput, and SageMaker endpoints — represent increasing levels of capacity commitment and infrastructure control, with corresponding changes in cost structure, latency characteristics, and operational overhead.
Deployment mode comparison:
| Mode | How It Works | Best For | Cost Model | Startup Latency |
|---|---|---|---|---|
| Bedrock On-Demand | Serverless; capacity managed by AWS | Variable, unpredictable traffic | Per token | ~100ms |
| Bedrock Provisioned Throughput | Reserved model units; dedicated capacity | Sustained high-volume; latency-sensitive SLAs | Per model unit/hour | <50ms |
| SageMaker Endpoint | Your instance, your GPU; always-on | Custom/fine-tuned models; full control | Per instance/hour | Warm: <100ms; Cold start: minutes |
| SageMaker Serverless | On-demand, scales to zero | Infrequent traffic; fine-tuned models | Per token + invocation | Cold start: 1–5 min |
Bedrock Provisioned Throughput — the key mechanics:
- You purchase model units (MUs), each representing a fixed number of tokens per minute
- Capacity is reserved and always available — no throttling under normal operation
- Minimum commitment is 1 month; unused capacity is still billed
- Required for: guaranteed SLAs, highest throughput scenarios, and deploying fine-tuned Bedrock models to production
Lambda for on-demand Bedrock invocation: Lambda is the canonical compute layer for Bedrock on-demand calls. A Lambda function receives the request, constructs the Bedrock API call, handles retries, and returns the response. Timeout configuration is critical — Bedrock calls can take 30–120 seconds for long-context generation.
import boto3, json
bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')
def lambda_handler(event, context):
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 1024,
"messages": [{"role": "user", "content": event['query']}]
}
response = bedrock_runtime.invoke_model(
modelId='anthropic.claude-3-haiku-20240307-v1:0',
body=json.dumps(body),
contentType='application/json',
accept='application/json'
)
return json.loads(response['body'].read())['content'][0]['text']
⚠️ Exam Trap: AWS Lambda has a default timeout of 3 seconds and a maximum of 15 minutes. Long-context Bedrock calls that exceed 15 minutes will be killed by Lambda. For very long-context generation, use Step Functions with a Lambda + SQS async pattern, or SageMaker batch transform for non-real-time workloads.
Reflection Question: Your GenAI application serves 50,000 requests per day with a consistent pattern of 80% of traffic between 9am–5pm EST and near-zero traffic overnight. Should you use Bedrock on-demand or provisioned throughput? What calculation would you perform to determine the break-even point?