AWS-MLS-C01 & AWS CERTIFICATION | Data Type Conversion and Normalization/Standardization - AWS Certified Machine Learning

3.1.2. Data Type Conversion and Normalization/Standardization

First Principle: Data type conversion and normalization/standardization fundamentally transform features into a consistent, algorithm-friendly numerical scale, optimizing convergence speed and preventing features with larger ranges from dominating model training.

Machine learning algorithms typically perform best when numerical features are on a similar scale. Data type conversion ensures data is in a numerical format, while normalization and standardization scale these numerical features.

Key Concepts:

Data Type Conversion:
- Purpose: Convert data from one type to another (e.g., string to numerical, boolean to 0/1). Critical for features that are categorical (e.g., "Male", "Female") or textual but need to be consumed by numerical algorithms.
- Common Conversions: String to numeric (e.g., for categorical encoding), date/time to numerical (e.g., Unix timestamp, day of week), boolean to 0/1.
Feature Scaling: Techniques to change the range or distribution of numerical features.
- Purpose: Prevent features with larger values from dominating the learning process (especially for distance-based algorithms like SVMs, K-Means, or gradient descent-based algorithms). Speed up convergence of optimization algorithms.
Types of Feature Scaling:
- Normalization (Min-Max Scaling):
  - Method: Scales features to a fixed range, usually [0, 1] or [-1, 1].
  - Formula: X_normalized = (X - X_min) / (X_max - X_min)
  - Use Cases: When the feature distribution is not Gaussian, or when the algorithm is sensitive to the exact range.
- Standardization (Z-score Normalization):
  - Method: Scales features to have a mean of 0 and a standard deviation of 1.
  - Formula: X_standardized = (X - μ) / σ (where μ is mean, σ is standard deviation).
  - Use Cases: When the feature distribution is approximately Gaussian, or when the algorithm assumes a normal distribution. Less sensitive to outliers than Min-Max scaling.
When to Apply: Typically applied after handling missing values and before training the model. The scaling parameters (min/max or mean/std dev) should be learned only from the training data to avoid data leakage.

AWS Tools:

SageMaker Data Wrangler provides built-in transformations for Min-Max scaling, standard scaling, and various other numerical transformations.
SageMaker Processing Jobs and Glue ETL Jobs for custom implementations using libraries like Scikit-learn's MinMaxScaler or StandardScaler.

Scenario: You are training a linear regression model to predict housing prices using features like "area_sq_ft" (ranging from 500 to 5000) and "num_bedrooms" (ranging from 1 to 5). You want to ensure these features contribute equally to the model and that the optimization algorithm converges efficiently.

Reflection Question: How do data type conversion and normalization/standardization (e.g., Min-Max scaling, Z-score standardization) fundamentally transform features into a consistent, algorithm-friendly numerical scale, optimizing convergence speed and preventing features with larger ranges from dominating model training?