3.3.2. Numerical Feature Transformations (Log, Polynomial, Binning)
First Principle: Numerical feature transformations fundamentally adjust the distribution and relationships of numerical data, enabling algorithms to better capture non-linear patterns, mitigate skewness, and improve model linearity assumptions.
Beyond simple scaling, numerical features often benefit from transformations to better fit algorithm assumptions (e.g., normality, linearity) or to capture more complex relationships.
Key Numerical Feature Transformations Techniques:
- Log Transformation:
- Method: Apply the natural logarithm (ln) or base-10 logarithm to a feature (e.g.,
log(X)
,log(1+X)
for features that can be zero). - Use Cases:
- Reduce Skewness: Useful for highly skewed positive distributions (e.g., income, house size, transaction amounts).
- Stabilize Variance: Can make variance more consistent across different ranges of the feature.
- Handle Exponential Relationships: Convert exponential relationships into linear ones.
- Limitation: Only applicable to positive values.
- Method: Apply the natural logarithm (ln) or base-10 logarithm to a feature (e.g.,
- Polynomial Features:
- Method: Create new features by raising existing features to a power (e.g.,
X^2
,X^3
) or creating interaction terms (e.g.,X1 * X2
). - Use Cases: Capture non-linear relationships between features and the target variable, or interactions between features.
- Limitation: Can lead to a high number of features (high dimensionality) and potential overfitting if not carefully managed.
- Method: Create new features by raising existing features to a power (e.g.,
- Binning (Discretization):
- Method: Convert a continuous numerical feature into a set of discrete bins or intervals.
- Example: Age (continuous) -> Age Group (e.g., "18-25", "26-35", "36+").
- Types:
- Equal-width binning: Bins have the same width.
- Equal-frequency (quantile) binning: Each bin has approximately the same number of data points.
- Use Cases: Handle outliers, reduce noise, make non-linear relationships more linear (by converting to categorical), and can be useful for algorithms that prefer categorical inputs.
- Limitation: Loss of information due to discretization.
- Other Transformations: Square root, cube root, reciprocal, power transforms (e.g., Box-Cox, Yeo-Johnson for more general cases).
AWS Tools:
- SageMaker Data Wrangler offers built-in transformations for log transformation, polynomial features, and binning.
- SageMaker Processing Jobs or Glue ETL Jobs for custom implementations using libraries like Scikit-learn's
PolynomialFeatures
or custom Python/Spark code.
Scenario: You are building a regression model to predict customer spending, where the "income" feature is heavily right-skewed. You also suspect that the relationship between "age" and "spending" might be non-linear, and that grouping customer ages into "age_bins" might simplify the model.
Reflection Question: How do numerical feature transformations (e.g., log transformation for skewed data, polynomial features for non-linear relationships, binning for discretization) fundamentally adjust the distribution and relationships of numerical data, enabling algorithms to better capture non-linear patterns and improve model linearity assumptions?