Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

6.1.3. Provisioned Throughput and Batch Optimization

💡 First Principle: Provisioned throughput transforms FM pricing from a variable cost (pay per token) to a fixed cost (pay per reserved model unit per hour). This only saves money when utilization is high enough that the fixed cost is less than what you'd pay on-demand — typically above 60–70% sustained utilization.

Provisioned throughput break-even calculation:
# Break-even analysis: provisioned vs on-demand
def calculate_breakeven(
    model_units: int,           # Number of model units purchased
    cost_per_mu_per_hour: float,  # e.g., $5.00 per MU/hour for Claude Haiku
    on_demand_input_price: float, # e.g., $0.00025 per 1K tokens
    on_demand_output_price: float,# e.g., $0.00125 per 1K tokens
    avg_input_tokens: int,
    avg_output_tokens: int,
    tokens_per_mu_per_minute: int  # Capacity of one model unit
):
    max_tokens_per_hour = model_units * tokens_per_mu_per_minute * 60
    
    # On-demand cost at 100% utilization of provisioned capacity
    max_on_demand_cost = (
        (max_tokens_per_hour * avg_input_tokens / (avg_input_tokens + avg_output_tokens)) 
        / 1000 * on_demand_input_price +
        (max_tokens_per_hour * avg_output_tokens / (avg_input_tokens + avg_output_tokens))
        / 1000 * on_demand_output_price
    ) 
    
    provisioned_hourly_cost = model_units * cost_per_mu_per_hour
    
    breakeven_utilization = provisioned_hourly_cost / max_on_demand_cost
    print(f"Break-even utilization: {breakeven_utilization:.1%}")
    print(f"Provisioned cost at 100% util: ${provisioned_hourly_cost:.2f}/hr")
    print(f"On-demand cost at 100% util: ${max_on_demand_cost:.2f}/hr")

Bedrock batch inference for cost optimization: For non-real-time workloads (nightly processing, document classification, bulk summarization), Bedrock Batch Inference processes large job queues at reduced pricing:

# Submit a batch inference job — processes items from S3 JSONL input
response = bedrock.create_model_invocation_job(
    jobName='nightly-document-summary-batch',
    modelId='anthropic.claude-3-haiku-20240307-v1:0',
    inputDataConfig={
        's3InputDataConfig': {
            's3Uri': 's3://my-bucket/batch-input/documents.jsonl',
            's3InputFormat': 'JSONL'
        }
    },
    outputDataConfig={
        's3OutputDataConfig': {
            's3Uri': 's3://my-bucket/batch-output/',
            's3EncryptionKeyId': 'arn:aws:kms:...:key/KEY-ID'
        }
    },
    roleArn='arn:aws:iam::123456789:role/BedrockBatchRole'
)
# Batch jobs process at ~50% discount vs on-demand — but take hours, not seconds
Input JSONL format for batch inference:
{"recordId": "doc-001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "messages": [{"role": "user", "content": "Summarize: [document text]"}]}}
{"recordId": "doc-002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "messages": [{"role": "user", "content": "Summarize: [document text]"}]}}

⚠️ Exam Trap: Bedrock Batch Inference has a minimum job size and cannot be used for real-time responses — it's designed for large offline workloads. Batch job completion time is non-deterministic and may take hours. Exam scenarios requiring sub-second or sub-minute response SLAs cannot use Batch Inference.

Reflection Question: Your nightly document processing pipeline currently invokes Bedrock synchronously for 50,000 documents between midnight and 6am. Costs are high and the pipeline often finishes late, delaying morning reports. What two changes would you make to reduce cost and improve reliability, and what is the architectural pattern for each?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications