2.1.1. Model Selection Criteria and Benchmarking
💡 First Principle: Model selection must be empirical, not intuitive — run your actual representative queries against candidate models, measure against your specific quality criteria, and let the data decide. Benchmarks from research papers measure academic tasks, not your business problem.
The evaluation framework for model selection:
| Criterion | What to Measure | How to Measure |
|---|---|---|
| Task quality | Does output meet minimum acceptable threshold? | Bedrock Model Evaluations, human evaluation on golden dataset |
| Latency | P50/P99 response time under your load pattern | CloudWatch, X-Ray tracing with load test |
| Context window | Maximum tokens for your longest expected inputs | Check model spec; test with actual long docs |
| Cost per query | Input tokens × price + output tokens × price | Calculate from Bedrock pricing page for candidate models |
| Modality support | Text only, or image/audio/document? | Model card on Bedrock console |
| Regional availability | Available in your required AWS region? | Bedrock console regional model list |
| Fine-tuning support | Can be customized via Bedrock fine-tuning? | Bedrock documentation — not all models support it |
Implementing model evaluation with Bedrock Model Evaluations: Bedrock provides a managed evaluation job that runs your prompt dataset against one or more models and scores outputs on relevance, accuracy, fluency, and custom criteria. For head-to-head model comparison, this is the production-grade approach — not manual spot-checking.
# Example: trigger a Bedrock model evaluation job
import boto3
bedrock = boto3.client('bedrock', region_name='us-east-1')
response = bedrock.create_evaluation_job(
jobName='model-selection-eval-q4',
evaluationConfig={
'automated': {
'datasetMetricConfigs': [{
'taskType': 'QuestionAndAnswer',
'dataset': {'name': 'golden-qa-dataset', 's3Uri': 's3://my-bucket/eval/'},
'metricNames': ['Helpfulness', 'Faithfulness', 'Coherence']
}]
}
},
inferenceConfig={
'models': [
{'bedrockModel': {'modelIdentifier': 'anthropic.claude-3-haiku-20240307-v1:0'}},
{'bedrockModel': {'modelIdentifier': 'anthropic.claude-3-sonnet-20240229-v1:0'}}
]
},
outputDataConfig={'s3Uri': 's3://my-bucket/eval-results/'},
roleArn='arn:aws:iam::123456789012:role/BedrockEvaluationRole'
)
⚠️ Exam Trap: Bedrock Model Evaluations requires an IAM role with specific permissions to read your evaluation dataset from S3 and write results back. A common failure mode in exam scenarios is missing IAM permissions on the evaluation job role — the job fails silently or with an access denied error on the S3 dataset.
Reflection Question: You need to choose between Claude 3 Haiku and Claude 3 Sonnet for a document classification task processing 5 million documents monthly. What three pieces of information would you gather before making the decision, and which service on AWS automates the quality comparison?