Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.2.1. Deployment Strategies: On-Demand, Provisioned, and SageMaker Endpoints

💡 First Principle: The three FM deployment modes — on-demand Bedrock, Bedrock provisioned throughput, and SageMaker endpoints — represent increasing levels of capacity commitment and infrastructure control, with corresponding changes in cost structure, latency characteristics, and operational overhead.

Deployment mode comparison:
ModeHow It WorksBest ForCost ModelStartup Latency
Bedrock On-DemandServerless; capacity managed by AWSVariable, unpredictable trafficPer token~100ms
Bedrock Provisioned ThroughputReserved model units; dedicated capacitySustained high-volume; latency-sensitive SLAsPer model unit/hour<50ms
SageMaker EndpointYour instance, your GPU; always-onCustom/fine-tuned models; full controlPer instance/hourWarm: <100ms; Cold start: minutes
SageMaker ServerlessOn-demand, scales to zeroInfrequent traffic; fine-tuned modelsPer token + invocationCold start: 1–5 min
Bedrock Provisioned Throughput — the key mechanics:
  • You purchase model units (MUs), each representing a fixed number of tokens per minute
  • Capacity is reserved and always available — no throttling under normal operation
  • Minimum commitment is 1 month; unused capacity is still billed
  • Required for: guaranteed SLAs, highest throughput scenarios, and deploying fine-tuned Bedrock models to production

Lambda for on-demand Bedrock invocation: Lambda is the canonical compute layer for Bedrock on-demand calls. A Lambda function receives the request, constructs the Bedrock API call, handles retries, and returns the response. Timeout configuration is critical — Bedrock calls can take 30–120 seconds for long-context generation.

import boto3, json

bedrock_runtime = boto3.client('bedrock-runtime', region_name='us-east-1')

def lambda_handler(event, context):
    body = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "messages": [{"role": "user", "content": event['query']}]
    }
    response = bedrock_runtime.invoke_model(
        modelId='anthropic.claude-3-haiku-20240307-v1:0',
        body=json.dumps(body),
        contentType='application/json',
        accept='application/json'
    )
    return json.loads(response['body'].read())['content'][0]['text']

⚠️ Exam Trap: AWS Lambda has a default timeout of 3 seconds and a maximum of 15 minutes. Long-context Bedrock calls that exceed 15 minutes will be killed by Lambda. For very long-context generation, use Step Functions with a Lambda + SQS async pattern, or SageMaker batch transform for non-real-time workloads.

Reflection Question: Your GenAI application serves 50,000 requests per day with a consistent pattern of 80% of traffic between 9am–5pm EST and near-zero traffic overnight. Should you use Bedrock on-demand or provisioned throughput? What calculation would you perform to determine the break-even point?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications