5.2.2. Validation Criteria for Custom AI Models
💡 First Principle: Model validation is the gatekeeper between development and production. A model that performs well on training data may fail on production data if the validation criteria don't test for the right things: accuracy on representative data, bias across demographics, stability over time, and behavior at the boundaries.
Validation Criteria Framework:
| Criterion | What It Evaluates | Measurement |
|---|---|---|
| Accuracy | Does the model produce correct outputs? | Precision, recall, F1 score on held-out test set |
| Bias | Does the model perform equally across demographics? | Accuracy disaggregated by protected attributes |
| Robustness | Does the model handle edge cases and adversarial inputs? | Performance on out-of-distribution and adversarial test sets |
| Calibration | Do the model's confidence scores reflect actual accuracy? | Expected calibration error |
| Latency | Does inference meet performance requirements? | P50, P95, P99 latency measurements |
| Drift tolerance | How much can input distribution change before accuracy degrades? | Performance on synthetically shifted test data |
Training Data Quality Validation:
Before validating the model, validate its training data. Training data must be representative (covers the distribution the model will encounter in production), balanced (no demographic or categorical skew unless intentionally designed), clean (minimal label noise), and properly split (no data leakage between train/validation/test sets).
Benchmark Testing:
Compare model performance against established baselines — either a previous model version, a competing model, or human performance. The comparison should include both aggregate metrics and per-category breakdowns to catch scenarios where aggregate performance is acceptable but specific categories are failing.
⚠️ Exam Trap: A scenario describes a model with 94% accuracy that the team wants to deploy. A distractor approves deployment based on accuracy alone. The correct answer asks for accuracy disaggregated by category — the model might be 99% accurate on common categories and 60% accurate on rare but critical ones.
Troubleshooting Scenario: A custom AI model for invoice classification achieves 97% accuracy in validation but only 82% in production. Investigation reveals the training data was 90% digital invoices from automated systems, but 35% of production invoices are photographed paper documents with stamps, handwriting, and coffee stains. This is a classic data distribution mismatch — the model was never exposed to the variety it faces in production. The validation process should have included: (1) training data representativeness audit against production data profiles, (2) stratified validation across data subtypes, and (3) a specific edge-case benchmark for underrepresented categories.
⚠️ Exam Trap: High validation accuracy doesn't guarantee production quality. Always check whether the validation data represents the full distribution of production inputs — not just the clean, common cases.
Reflection Question: A custom document classification model achieves 92% F1 score on the test set. Before approving production deployment, what additional validation criteria should the architect require?