2.2.1. Data Cleaning Techniques: Outliers, Missing Data, Deduplication
💡 First Principle: Data cleaning removes noise that would confuse the model. The key insight is that different types of noise require different treatments—dropping an outlier that represents a real edge case (fraud) is fundamentally different from dropping an outlier caused by a sensor error (data quality issue). The exam tests whether you can distinguish between the two.
Outlier Detection and Treatment:
Outliers fall into two categories: informative outliers (real but rare events like high-value fraud transactions) and erroneous outliers (data entry errors, sensor malfunctions). The treatment differs dramatically:
| Outlier Type | Example | Treatment | Danger of Wrong Treatment |
|---|---|---|---|
| Informative | $50,000 fraud transaction | Keep or oversample | Removing eliminates the very pattern you're trying to detect |
| Erroneous | Age = -5 or salary = $999,999,999 | Remove or cap | Keeping teaches the model impossible patterns |
| Ambiguous | Purchase of 100 laptops | Investigate domain context | Blind removal may lose legitimate bulk orders |
Common detection methods include IQR (Interquartile Range)—values below Q1 − 1.5×IQR or above Q3 + 1.5×IQR—and Z-score (values more than 3 standard deviations from the mean). For ML specifically, SageMaker's Random Cut Forest algorithm is designed for anomaly detection at scale.
Missing Data Imputation:
| Strategy | When to Use | AWS Tool |
|---|---|---|
| Drop rows | Missingness is random AND <5% of data | Glue/DataBrew transforms |
| Mean/median imputation | Numerical features, random missingness | Data Wrangler built-in transform |
| Mode imputation | Categorical features | Data Wrangler built-in transform |
| Forward/backward fill | Time-series data | Pandas in Glue/EMR job |
| KNN imputation | Values correlated with other features | Custom Spark/scikit-learn in EMR |
| Flag + impute | Missingness itself is informative | Add "is_missing" binary feature + impute |
Deduplication: Exact duplicates are straightforward—remove them. Fuzzy duplicates (same entity with slight variations like "Jon Smith" vs "John Smith") require record linkage techniques. AWS Glue FindMatches ML transform is specifically built for fuzzy deduplication at scale.
⚠️ Exam Trap: When a question describes a dataset with 30% missing values in a critical feature, "drop the rows" is almost always wrong—you'd lose too much data. Look for imputation strategies or flag-and-impute patterns. Conversely, when missingness is <1% and random, elaborate imputation is over-engineering.
Reflection Question: A dataset for predicting customer churn has a "last_login_date" column that's NULL for 15% of records. These NULLs all correspond to customers who never logged in after creating an account. Should you impute or is the missingness itself a feature?