3.1.2. Data Type Conversion and Normalization/Standardization
First Principle: Data type conversion and normalization/standardization fundamentally transform features into a consistent, algorithm-friendly numerical scale, optimizing convergence speed and preventing features with larger ranges from dominating model training.
Machine learning algorithms typically perform best when numerical features are on a similar scale. Data type conversion ensures data is in a numerical format, while normalization and standardization scale these numerical features.
Key Concepts:
- Data Type Conversion:
- Purpose: Convert data from one type to another (e.g., string to numerical, boolean to 0/1). Critical for features that are categorical (e.g., "Male", "Female") or textual but need to be consumed by numerical algorithms.
- Common Conversions: String to numeric (e.g., for categorical encoding), date/time to numerical (e.g., Unix timestamp, day of week), boolean to 0/1.
- Feature Scaling: Techniques to change the range or distribution of numerical features.
- Purpose: Prevent features with larger values from dominating the learning process (especially for distance-based algorithms like SVMs, K-Means, or gradient descent-based algorithms). Speed up convergence of optimization algorithms.
- Types of Feature Scaling:
- Normalization (Min-Max Scaling):
- Method: Scales features to a fixed range, usually [0, 1] or [-1, 1].
- Formula:
X_normalized = (X - X_min) / (X_max - X_min)
- Use Cases: When the feature distribution is not Gaussian, or when the algorithm is sensitive to the exact range.
- Standardization (Z-score Normalization):
- Method: Scales features to have a mean of 0 and a standard deviation of 1.
- Formula:
X_standardized = (X - μ) / σ
(where μ is mean, σ is standard deviation). - Use Cases: When the feature distribution is approximately Gaussian, or when the algorithm assumes a normal distribution. Less sensitive to outliers than Min-Max scaling.
- Normalization (Min-Max Scaling):
- When to Apply: Typically applied after handling missing values and before training the model. The scaling parameters (min/max or mean/std dev) should be learned only from the training data to avoid data leakage.
AWS Tools:
- SageMaker Data Wrangler provides built-in transformations for Min-Max scaling, standard scaling, and various other numerical transformations.
- SageMaker Processing Jobs and Glue ETL Jobs for custom implementations using libraries like Scikit-learn's
MinMaxScaler
orStandardScaler
.
Scenario: You are training a linear regression model to predict housing prices using features like "area_sq_ft" (ranging from 500 to 5000) and "num_bedrooms" (ranging from 1 to 5). You want to ensure these features contribute equally to the model and that the optimization algorithm converges efficiently.
Reflection Question: How do data type conversion and normalization/standardization (e.g., Min-Max scaling, Z-score standardization) fundamentally transform features into a consistent, algorithm-friendly numerical scale, optimizing convergence speed and preventing features with larger ranges from dominating model training?