Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

3.3.1. Evaluation Metrics: Confusion Matrix, F1, RMSE, ROC-AUC

💡 First Principle: Classification metrics and regression metrics are fundamentally different because they measure different things. Classification asks "did you get the category right?" Regression asks "how far off was the number?" Never apply a classification metric to a regression problem or vice versa.

Classification Metrics (from the Confusion Matrix):
MetricFormulaOptimizes ForUse When
Accuracy(TP + TN) / TotalOverall correctnessBalanced classes
PrecisionTP / (TP + FP)Minimizing false positivesCost of false positive is high (spam filter, fraud alert)
Recall (Sensitivity)TP / (TP + FN)Minimizing false negativesCost of missing positives is high (cancer detection, fraud)
F1 Score2 × (Precision × Recall) / (Precision + Recall)Balance of precision and recallImbalanced classes, need both
AUC-ROCArea under ROC curveOverall discriminative abilityComparing models, threshold-independent evaluation
Regression Metrics:
MetricWhat It MeasuresSensitive ToUse When
RMSERoot mean squared errorLarge errors (penalizes outliers)Outlier errors are costly
MAEMean absolute errorAverage error magnitudeWant robust to outliers
Proportion of variance explainedModel fit qualityComparing models on same data
MAPEMean absolute percentage errorRelative errorErrors should be proportional to magnitude

Shadow Variants for Production Comparison: SageMaker supports deploying a new model version as a "shadow variant" that receives a copy of production traffic but doesn't serve responses to users. You compare the shadow variant's predictions against the production model to validate performance before full deployment.

⚠️ Exam Trap: When a question says "imbalanced dataset" and asks which metric to use, accuracy is almost always the wrong answer. On a dataset with 95% negative class, a model predicting all negatives gets 95% accuracy. Use F1, precision, recall, or AUC instead. The exam frequently uses this pattern.

Reflection Question: A credit card company wants to detect fraudulent transactions. Only 0.1% of transactions are fraudulent. A model predicts all transactions as legitimate and achieves 99.9% accuracy. Explain why this model is useless and which metric would reveal the problem.

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications