7.1.3. A/B Testing and Incremental Rollout
💡 First Principle: A/B testing for FM applications tests whether a change (new model, new prompt, new retrieval parameters) improves measured outcomes, not just whether the new version works. The control group (existing behavior) is the baseline — the question is whether the treatment group (new behavior) is statistically significantly better or worse.
A/B testing architecture with Lambda aliases:
# Lambda alias routes traffic: 90% to control (v1), 10% to treatment (v2)
lambda_client.update_alias(
FunctionName='genai-query-handler',
Name='production',
FunctionVersion='$LATEST',
RoutingConfig={'AdditionalVersionWeights': {'2': 0.10}} # 10% to version 2
)
# Each version logs which variant handled the request
def lambda_handler(event, context):
variant = 'treatment' if is_treatment_version() else 'control'
response = invoke_bedrock(event['query'])
# Log variant and outcome for A/B analysis
cloudwatch.put_metric_data(
Namespace='GenAI/ABTest',
MetricData=[
{'MetricName': 'UserSatisfactionScore',
'Value': response.get('user_score', 0),
'Dimensions': [{'Name': 'Variant', 'Value': variant}]},
{'MetricName': 'ResponseLatencyMs',
'Value': response['latency_ms'],
'Dimensions': [{'Name': 'Variant', 'Value': variant}]}
]
)
Statistical significance in FM A/B tests: A/B tests need sufficient sample sizes before drawing conclusions. FMs can perform differently for different query types — an early sample that over-represents one query category may show false positives:
from scipy import stats
def evaluate_ab_test(control_scores, treatment_scores, alpha=0.05):
"""Two-sample t-test for A/B significance."""
t_stat, p_value = stats.ttest_ind(control_scores, treatment_scores)
control_mean = np.mean(control_scores)
treatment_mean = np.mean(treatment_scores)
relative_improvement = (treatment_mean - control_mean) / control_mean
result = {
'significant': p_value < alpha,
'p_value': p_value,
'control_mean': control_mean,
'treatment_mean': treatment_mean,
'relative_improvement': f"{relative_improvement:.1%}",
'recommendation': 'SHIP' if (p_value < alpha and treatment_mean > control_mean) else 'HOLD'
}
return result
⚠️ Exam Trap: A/B testing FM applications requires a much larger sample size than traditional software A/B tests because FM output quality has high variance — two invocations of the same prompt can produce measurably different quality responses. A typical software A/B test might require 1,000 samples; FM quality A/B tests often require 5,000–10,000+ samples per variant for statistical power.
Reflection Question: You want to test whether switching from Claude 3 Haiku to Claude 3 Sonnet for customer support queries improves user satisfaction scores. You have 50,000 queries per day. How would you structure the A/B test, what metrics would you collect, and how long would you run it before making a decision?