Copyright (c) 2026 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

2.2.1. Data Cleaning Techniques: Outliers, Missing Data, Deduplication

💡 First Principle: Data cleaning removes noise that would confuse the model. The key insight is that different types of noise require different treatments—dropping an outlier that represents a real edge case (fraud) is fundamentally different from dropping an outlier caused by a sensor error (data quality issue). The exam tests whether you can distinguish between the two.

Outlier Detection and Treatment:

Outliers fall into two categories: informative outliers (real but rare events like high-value fraud transactions) and erroneous outliers (data entry errors, sensor malfunctions). The treatment differs dramatically:

Outlier TypeExampleTreatmentDanger of Wrong Treatment
Informative$50,000 fraud transactionKeep or oversampleRemoving eliminates the very pattern you're trying to detect
ErroneousAge = -5 or salary = $999,999,999Remove or capKeeping teaches the model impossible patterns
AmbiguousPurchase of 100 laptopsInvestigate domain contextBlind removal may lose legitimate bulk orders

Common detection methods include IQR (Interquartile Range)—values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR—and Z-score (values more than 3 standard deviations from the mean). For ML specifically, SageMaker's Random Cut Forest algorithm is designed for anomaly detection at scale.

Missing Data Imputation:
StrategyWhen to UseAWS Tool
Drop rowsMissingness is random AND <5% of dataGlue/DataBrew transforms
Mean/median imputationNumerical features, random missingnessData Wrangler built-in transform
Mode imputationCategorical featuresData Wrangler built-in transform
Forward/backward fillTime-series dataPandas in Glue/EMR job
KNN imputationValues correlated with other featuresCustom Spark/scikit-learn in EMR
Flag + imputeMissingness itself is informativeAdd "is_missing" binary feature + impute

Deduplication: Exact duplicates are straightforward—remove them. Fuzzy duplicates (same entity with slight variations like "Jon Smith" vs "John Smith") require record linkage techniques. AWS Glue FindMatches ML transform is specifically built for fuzzy deduplication at scale.

⚠️ Exam Trap: When a question describes a dataset with 30% missing values in a critical feature, "drop the rows" is almost always wrong—you'd lose too much data. Look for imputation strategies or flag-and-impute patterns. Conversely, when missingness is <1% and random, elaborate imputation is over-engineering.

Reflection Question: A dataset for predicting customer churn has a "last_login_date" column that's NULL for 15% of records. These NULLs all correspond to customers who never logged in after creating an account. Should you impute or is the missingness itself a feature?

Alvin Varughese
Written byAlvin Varughese
Founder15 professional certifications