3.3.1. Evaluation Metrics: Confusion Matrix, F1, RMSE, ROC-AUC
💡 First Principle: Classification metrics and regression metrics are fundamentally different because they measure different things. Classification asks "did you get the category right?" Regression asks "how far off was the number?" Never apply a classification metric to a regression problem or vice versa.
Classification Metrics (from the Confusion Matrix):
| Metric | Formula | Optimizes For | Use When |
|---|---|---|---|
| Accuracy | (TP + TN) / Total | Overall correctness | Balanced classes |
| Precision | TP / (TP + FP) | Minimizing false positives | Cost of false positive is high (spam filter, fraud alert) |
| Recall (Sensitivity) | TP / (TP + FN) | Minimizing false negatives | Cost of missing positives is high (cancer detection, fraud) |
| F1 Score | 2 × (Precision × Recall) / (Precision + Recall) | Balance of precision and recall | Imbalanced classes, need both |
| AUC-ROC | Area under ROC curve | Overall discriminative ability | Comparing models, threshold-independent evaluation |
Regression Metrics:
| Metric | What It Measures | Sensitive To | Use When |
|---|---|---|---|
| RMSE | Root mean squared error | Large errors (penalizes outliers) | Outlier errors are costly |
| MAE | Mean absolute error | Average error magnitude | Want robust to outliers |
| R² | Proportion of variance explained | Model fit quality | Comparing models on same data |
| MAPE | Mean absolute percentage error | Relative error | Errors should be proportional to magnitude |
Shadow Variants for Production Comparison: SageMaker supports deploying a new model version as a "shadow variant" that receives a copy of production traffic but doesn't serve responses to users. You compare the shadow variant's predictions against the production model to validate performance before full deployment.
⚠️ Exam Trap: When a question says "imbalanced dataset" and asks which metric to use, accuracy is almost always the wrong answer. On a dataset with 95% negative class, a model predicting all negatives gets 95% accuracy. Use F1, precision, recall, or AUC instead. The exam frequently uses this pattern.
Reflection Question: A credit card company wants to detect fraudulent transactions. Only 0.1% of transactions are fraudulent. A model predicts all transactions as legitimate and achieves 99.9% accuracy. Explain why this model is useless and which metric would reveal the problem.