Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

7.1.3. A/B Testing and Incremental Rollout

💡 First Principle: A/B testing for FM applications tests whether a change (new model, new prompt, new retrieval parameters) improves measured outcomes, not just whether the new version works. The control group (existing behavior) is the baseline — the question is whether the treatment group (new behavior) is statistically significantly better or worse.

A/B testing architecture with Lambda aliases:
# Lambda alias routes traffic: 90% to control (v1), 10% to treatment (v2)
lambda_client.update_alias(
    FunctionName='genai-query-handler',
    Name='production',
    FunctionVersion='$LATEST',
    RoutingConfig={'AdditionalVersionWeights': {'2': 0.10}}  # 10% to version 2
)

# Each version logs which variant handled the request
def lambda_handler(event, context):
    variant = 'treatment' if is_treatment_version() else 'control'
    response = invoke_bedrock(event['query'])
    
    # Log variant and outcome for A/B analysis
    cloudwatch.put_metric_data(
        Namespace='GenAI/ABTest',
        MetricData=[
            {'MetricName': 'UserSatisfactionScore',
             'Value': response.get('user_score', 0),
             'Dimensions': [{'Name': 'Variant', 'Value': variant}]},
            {'MetricName': 'ResponseLatencyMs',
             'Value': response['latency_ms'],
             'Dimensions': [{'Name': 'Variant', 'Value': variant}]}
        ]
    )

Statistical significance in FM A/B tests: A/B tests need sufficient sample sizes before drawing conclusions. FMs can perform differently for different query types — an early sample that over-represents one query category may show false positives:

from scipy import stats

def evaluate_ab_test(control_scores, treatment_scores, alpha=0.05):
    """Two-sample t-test for A/B significance."""
    t_stat, p_value = stats.ttest_ind(control_scores, treatment_scores)
    
    control_mean = np.mean(control_scores)
    treatment_mean = np.mean(treatment_scores)
    relative_improvement = (treatment_mean - control_mean) / control_mean
    
    result = {
        'significant': p_value < alpha,
        'p_value': p_value,
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'relative_improvement': f"{relative_improvement:.1%}",
        'recommendation': 'SHIP' if (p_value < alpha and treatment_mean > control_mean) else 'HOLD'
    }
    return result

⚠️ Exam Trap: A/B testing FM applications requires a much larger sample size than traditional software A/B tests because FM output quality has high variance — two invocations of the same prompt can produce measurably different quality responses. A typical software A/B test might require 1,000 samples; FM quality A/B tests often require 5,000–10,000+ samples per variant for statistical power.

Reflection Question: You want to test whether switching from Claude 3 Haiku to Claude 3 Sonnet for customer support queries improves user satisfaction scores. You have 50,000 queries per day. How would you structure the A/B test, what metrics would you collect, and how long would you run it before making a decision?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications