Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.3.1. Detecting and Mitigating Bias in Training Data

💡 First Principle: Bias in ML starts with biased data, not biased algorithms. An algorithm trained on unbiased data produces fair results; the same algorithm trained on biased data produces biased results. The exam tests your ability to detect pre-training bias using specific metrics and mitigate it using specific techniques.

Pre-Training Bias Metrics (SageMaker Clarify):
MetricWhat It MeasuresExampleConcern If
Class Imbalance (CI)Proportion difference between facet values80% male, 20% female in training dataCI far from 0
Difference in Proportions of Labels (DPL)Whether positive labels are distributed equally across facets60% of males labeled "hire," 30% of femalesDPL far from 0
KL DivergenceHow much one distribution diverges from anotherFeature distributions differ significantly between groupsLarge divergence values
Jensen-Shannon DivergenceSymmetric version of KL divergenceComparing label distributions between groupsLarge divergence values
Mitigation Techniques:
TechniqueHow It WorksWhen to UseAWS Tool
Random oversamplingDuplicate minority class samplesModerate imbalance, sufficient dataData Wrangler built-in
Random undersamplingRemove majority class samplesLarge dataset, extreme imbalanceData Wrangler built-in
SMOTEGenerate synthetic minority samplesModerate imbalance, tabular dataData Wrangler SMOTE transform
Stratified samplingMaintain class proportions during splittingAll cases (should be default)SageMaker Processing
ReweightingAssign higher weights to minority classWhen resampling is impracticalAlgorithm-level (class_weight parameter)

⚠️ Exam Trap: SageMaker Clarify is used for both pre-training bias detection (data analysis) and post-training bias detection (model predictions). When the question asks about bias before training, the relevant metrics are data-level metrics like CI and DPL. When the question asks about bias after training, the relevant metrics are model-level like Disparate Impact and Conditional Demographic Disparity. Don't mix the two contexts.

Reflection Question: A loan approval dataset has 70% applications from urban areas and 30% from rural areas. The approval rate is 65% for urban and 40% for rural. Which bias metrics would flag this, and what mitigation would you recommend?

Alvin Varughese
Written byAlvin Varughese
Founder18 professional certifications