AWS-MLS-C01 & AWS CERTIFICATION | Sampling Techniques (Oversampling, Undersampling, SMOTE) - AWS Certified Machine Learning

3.4.1. Sampling Techniques (Oversampling, Undersampling, SMOTE)

First Principle: Sampling techniques fundamentally address class imbalance in datasets by adjusting the number of samples in minority or majority classes, enabling models to learn more effectively from rare events and improve predictive performance for all classes.

Class imbalance is a common problem in classification tasks where the number of observations in one class (the majority class) is significantly higher than in other classes (the minority class). This can lead to models that perform well on the majority class but poorly on the minority class, which is often the class of interest (e.g., fraud, disease). Sampling techniques are used to rebalance the dataset.

Key Sampling Techniques:

Undersampling:
- Method: Reduces the number of samples in the majority class to match the number of samples in the minority class.
- Types: Random undersampling, Tomek links, Edited Nearest Neighbors (ENN).
- Pros: Can help balance the dataset and reduce training time.
- Cons: Risk of discarding potentially valuable information from the majority class, which can lead to underfitting.
Oversampling:
- Method: Increases the number of samples in the minority class to match the number of samples in the majority class.
- Types: Random oversampling (duplicating existing minority samples).
- Pros: No loss of information from the majority class.
- Cons: Can lead to overfitting, as the model might learn to classify the duplicated samples too well.
SMOTE (Synthetic Minority Over-sampling Technique):
- Method: A more sophisticated oversampling technique that generates synthetic samples for the minority class rather than simply duplicating existing ones. It creates new samples by taking a minority class sample and introducing synthetic examples along the line segments joining any of its k nearest neighbors.
- Pros: Reduces the risk of overfitting compared to simple oversampling, as it creates new, distinct samples.
- Cons: Can create noisy samples if the minority class is very sparse, can increase overlap between classes.
- Variants: Borderline-SMOTE, ADASYN.
Combined Approaches: Often, a combination of undersampling and oversampling (e.g., SMOTE followed by Tomek links) is used to achieve better balance and performance.

Impact on ML:

Improved model performance on the minority class.
More balanced evaluation metrics (e.g., F1-score, Recall).
Reduced bias towards the majority class.

AWS Tools:

SageMaker Data Wrangler offers built-in transformations for oversampling and undersampling.
SageMaker Processing Jobs can run custom Python/Spark scripts using libraries like Imblearn (for SMOTE and other advanced techniques) at scale.
XGBoost on SageMaker supports the scale_pos_weight parameter to handle class imbalance at the algorithm level, which is often preferred over explicit sampling if the algorithm supports it.

Scenario: You are working on a medical diagnosis model where the "positive" diagnosis class is very rare (e.g., 1% of the dataset). Training a model directly results in it always predicting "negative" to achieve high accuracy. You need to improve the model's ability to correctly identify positive cases.

Reflection Question: How do sampling techniques like undersampling, oversampling, and especially SMOTE, fundamentally address class imbalance by adjusting sample distribution, enabling models to learn more effectively from rare events and improve overall predictive performance?

💡 Tip: When using sampling techniques, apply them only to the training data. Do not apply them to the validation or test sets, as this would lead to an unrealistic evaluation of the model's performance on real-world imbalanced data.