Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.3.1. Detecting and Mitigating Bias in Training Data

šŸ’” First Principle: Bias in ML starts with biased data, not biased algorithms. An algorithm trained on unbiased data produces fair results; the same algorithm trained on biased data produces biased results. The exam tests your ability to detect pre-training bias using specific metrics and mitigate it using specific techniques.

Pre-Training Bias Metrics (SageMaker Clarify):
MetricWhat It MeasuresExampleConcern If
Class Imbalance (CI)Proportion difference between facet values80% male, 20% female in training dataCI far from 0
Difference in Proportions of Labels (DPL)Whether positive labels are distributed equally across facets60% of males labeled "hire," 30% of femalesDPL far from 0
KL DivergenceHow much one distribution diverges from anotherFeature distributions differ significantly between groupsLarge divergence values
Jensen-Shannon DivergenceSymmetric version of KL divergenceComparing label distributions between groupsLarge divergence values
Mitigation Techniques:
TechniqueHow It WorksWhen to UseAWS Tool
Random oversamplingDuplicate minority class samplesModerate imbalance, sufficient dataData Wrangler built-in
Random undersamplingRemove majority class samplesLarge dataset, extreme imbalanceData Wrangler built-in
SMOTEGenerate synthetic minority samplesModerate imbalance, tabular dataData Wrangler SMOTE transform
Stratified samplingMaintain class proportions during splittingAll cases (should be default)SageMaker Processing
ReweightingAssign higher weights to minority classWhen resampling is impracticalAlgorithm-level (class_weight parameter)

āš ļø Exam Trap: SageMaker Clarify is used for both pre-training bias detection (data analysis) and post-training bias detection (model predictions). When the question asks about bias before training, the relevant metrics are data-level metrics like CI and DPL. When the question asks about bias after training, the relevant metrics are model-level like Disparate Impact and Conditional Demographic Disparity. Don't mix the two contexts.

Reflection Question: A loan approval dataset has 70% applications from urban areas and 30% from rural areas. The approval rate is 65% for urban and 40% for rural. Which bias metrics would flag this, and what mitigation would you recommend?

Alvin Varughese
Written byAlvin Varughese
Founder•15 professional certifications