Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.1.1. Model Selection Criteria and Benchmarking

💡 First Principle: Model selection must be empirical, not intuitive — run your actual representative queries against candidate models, measure against your specific quality criteria, and let the data decide. Benchmarks from research papers measure academic tasks, not your business problem.

The evaluation framework for model selection:
CriterionWhat to MeasureHow to Measure
Task qualityDoes output meet minimum acceptable threshold?Bedrock Model Evaluations, human evaluation on golden dataset
LatencyP50/P99 response time under your load patternCloudWatch, X-Ray tracing with load test
Context windowMaximum tokens for your longest expected inputsCheck model spec; test with actual long docs
Cost per queryInput tokens × price + output tokens × priceCalculate from Bedrock pricing page for candidate models
Modality supportText only, or image/audio/document?Model card on Bedrock console
Regional availabilityAvailable in your required AWS region?Bedrock console regional model list
Fine-tuning supportCan be customized via Bedrock fine-tuning?Bedrock documentation — not all models support it

Implementing model evaluation with Bedrock Model Evaluations: Bedrock provides a managed evaluation job that runs your prompt dataset against one or more models and scores outputs on relevance, accuracy, fluency, and custom criteria. For head-to-head model comparison, this is the production-grade approach — not manual spot-checking.

# Example: trigger a Bedrock model evaluation job
import boto3
bedrock = boto3.client('bedrock', region_name='us-east-1')

response = bedrock.create_evaluation_job(
    jobName='model-selection-eval-q4',
    evaluationConfig={
        'automated': {
            'datasetMetricConfigs': [{
                'taskType': 'QuestionAndAnswer',
                'dataset': {'name': 'golden-qa-dataset', 's3Uri': 's3://my-bucket/eval/'},
                'metricNames': ['Helpfulness', 'Faithfulness', 'Coherence']
            }]
        }
    },
    inferenceConfig={
        'models': [
            {'bedrockModel': {'modelIdentifier': 'anthropic.claude-3-haiku-20240307-v1:0'}},
            {'bedrockModel': {'modelIdentifier': 'anthropic.claude-3-sonnet-20240229-v1:0'}}
        ]
    },
    outputDataConfig={'s3Uri': 's3://my-bucket/eval-results/'},
    roleArn='arn:aws:iam::123456789012:role/BedrockEvaluationRole'
)

⚠️ Exam Trap: Bedrock Model Evaluations requires an IAM role with specific permissions to read your evaluation dataset from S3 and write results back. A common failure mode in exam scenarios is missing IAM permissions on the evaluation job role — the job fails silently or with an access denied error on the S3 dataset.

Reflection Question: You need to choose between Claude 3 Haiku and Claude 3 Sonnet for a document classification task processing 5 million documents monthly. What three pieces of information would you gather before making the decision, and which service on AWS automates the quality comparison?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications