6.1.3. Provisioned Throughput and Batch Optimization
💡 First Principle: Provisioned throughput transforms FM pricing from a variable cost (pay per token) to a fixed cost (pay per reserved model unit per hour). This only saves money when utilization is high enough that the fixed cost is less than what you'd pay on-demand — typically above 60–70% sustained utilization.
Provisioned throughput break-even calculation:
# Break-even analysis: provisioned vs on-demand
def calculate_breakeven(
model_units: int, # Number of model units purchased
cost_per_mu_per_hour: float, # e.g., $5.00 per MU/hour for Claude Haiku
on_demand_input_price: float, # e.g., $0.00025 per 1K tokens
on_demand_output_price: float,# e.g., $0.00125 per 1K tokens
avg_input_tokens: int,
avg_output_tokens: int,
tokens_per_mu_per_minute: int # Capacity of one model unit
):
max_tokens_per_hour = model_units * tokens_per_mu_per_minute * 60
# On-demand cost at 100% utilization of provisioned capacity
max_on_demand_cost = (
(max_tokens_per_hour * avg_input_tokens / (avg_input_tokens + avg_output_tokens))
/ 1000 * on_demand_input_price +
(max_tokens_per_hour * avg_output_tokens / (avg_input_tokens + avg_output_tokens))
/ 1000 * on_demand_output_price
)
provisioned_hourly_cost = model_units * cost_per_mu_per_hour
breakeven_utilization = provisioned_hourly_cost / max_on_demand_cost
print(f"Break-even utilization: {breakeven_utilization:.1%}")
print(f"Provisioned cost at 100% util: ${provisioned_hourly_cost:.2f}/hr")
print(f"On-demand cost at 100% util: ${max_on_demand_cost:.2f}/hr")
Bedrock batch inference for cost optimization: For non-real-time workloads (nightly processing, document classification, bulk summarization), Bedrock Batch Inference processes large job queues at reduced pricing:
# Submit a batch inference job — processes items from S3 JSONL input
response = bedrock.create_model_invocation_job(
jobName='nightly-document-summary-batch',
modelId='anthropic.claude-3-haiku-20240307-v1:0',
inputDataConfig={
's3InputDataConfig': {
's3Uri': 's3://my-bucket/batch-input/documents.jsonl',
's3InputFormat': 'JSONL'
}
},
outputDataConfig={
's3OutputDataConfig': {
's3Uri': 's3://my-bucket/batch-output/',
's3EncryptionKeyId': 'arn:aws:kms:...:key/KEY-ID'
}
},
roleArn='arn:aws:iam::123456789:role/BedrockBatchRole'
)
# Batch jobs process at ~50% discount vs on-demand — but take hours, not seconds
Input JSONL format for batch inference:
{"recordId": "doc-001", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "messages": [{"role": "user", "content": "Summarize: [document text]"}]}}
{"recordId": "doc-002", "modelInput": {"anthropic_version": "bedrock-2023-05-31", "max_tokens": 512, "messages": [{"role": "user", "content": "Summarize: [document text]"}]}}
⚠️ Exam Trap: Bedrock Batch Inference has a minimum job size and cannot be used for real-time responses — it's designed for large offline workloads. Batch job completion time is non-deterministic and may take hours. Exam scenarios requiring sub-second or sub-minute response SLAs cannot use Batch Inference.
Reflection Question: Your nightly document processing pipeline currently invokes Bedrock synchronously for 50,000 documents between midnight and 6am. Costs are high and the pipeline often finishes late, delaying morning reports. What two changes would you make to reduce cost and improve reliability, and what is the architectural pattern for each?