5.1.3. A/B Testing for Production Models
💡 First Principle: You can't know if a new model version is better than the current one by evaluating it offline alone—real-world traffic patterns, edge cases, and user behavior create conditions that test datasets can't replicate. A/B testing in production provides the ground truth that offline evaluation can only approximate.
SageMaker supports A/B testing through production variants—multiple model versions behind a single endpoint, each receiving a configurable percentage of traffic. This lets you compare a new model (the challenger) against the current model (the champion) on identical live traffic.
A closely related concept is shadow variants (also called shadow deployments or shadow mode). A shadow variant receives the same traffic as the production model but its predictions are not returned to the caller. Instead, they're logged for offline comparison. This is the safest testing approach because it has zero impact on users—but it doesn't capture how users respond to predictions (e.g., click-through rates), which limits what you can measure.
| Testing Method | Traffic Impact | What You Can Measure | When to Use |
|---|---|---|---|
| A/B Testing (production variants) | Users see different predictions | Accuracy + business metrics (CTR, revenue) | Final validation before full rollout |
| Shadow Testing | Zero user impact | Accuracy metrics only (no user feedback) | Early validation, high-risk models |
| Canary Deployment | Small % sees new model | Real-time error rate, latency | Gradual rollout with automatic rollback |
⚠️ Exam Trap: Don't confuse A/B testing with canary deployment. A/B testing runs indefinitely to compare model quality. Canary deployment is a rollout strategy that gradually shifts traffic and can automatically roll back if errors increase. The exam often presents both as answer choices for model comparison questions.
Reflection Question: A financial services company wants to test a new credit scoring model but cannot risk exposing any customers to potentially worse decisions during testing. Which testing approach should they use, and why?