4.2.2. Dimensionality Reduction (PCA, t-SNE)
First Principle: Dimensionality reduction techniques fundamentally transform high-dimensional data into a lower-dimensional representation while preserving essential information, mitigating the "curse of dimensionality" and improving model performance, interpretability, and visualization.
High-dimensional datasets (many features) can lead to several problems in machine learning, known as the "curse of dimensionality." Dimensionality reduction aims to transform data into a lower-dimensional space.
Key Concepts of Dimensionality Reduction:
- Purpose:
- Mitigate Curse of Dimensionality: Improve model performance, reduce overfitting.
- Reduce Training Time: Fewer features mean faster training.
- Reduce Storage: Smaller datasets.
- Improve Visualization: Easier to plot and understand data in 2D/3D.
- Noise Reduction: Can filter out irrelevant features.
- Feature Selection vs. Feature Extraction:
- Feature Selection: Selects a subset of original features.
- Feature Extraction: Creates new, lower-dimensional features from the original ones.
- Linear vs. Non-linear Techniques:
- Linear: Preserve linear relationships (e.g., PCA).
- Non-linear: Preserve non-linear relationships, better for complex data (e.g., t-SNE).
Key Dimensionality Reduction Techniques:
- Principal Component Analysis (PCA):
- What it is: A linear dimensionality reduction technique that transforms data into a new set of orthogonal (uncorrelated) variables called principal components. Each component captures the maximum possible variance in the data.
- Strengths: Simple, effective for linear relationships, good for noise reduction.
- Limitations: Assumes linear relationships, components are often not easily interpretable, sensitive to feature scaling.
- AWS: SageMaker Principal Component Analysis (PCA) is a built-in algorithm.
- t-SNE (t-Distributed Stochastic Neighbor Embedding):
- What it is: A non-linear dimensionality reduction technique especially well-suited for visualizing high-dimensional data in 2 or 3 dimensions. It focuses on preserving local relationships (similar points remain close, dissimilar points remain far apart).
- Strengths: Excellent for visualization, can reveal intricate clusters.
- Limitations: Computationally expensive for large datasets, no direct mapping for new data, parameters (e.g., perplexity) can significantly affect results.
- Autoencoders: Neural network-based technique for learning a compressed representation.
Scenario: You have a high-dimensional dataset of customer demographics with over 100 features. You want to reduce the number of features to speed up model training and improve interpretability, while also being able to visualize customer clusters in a 2D plot.
Reflection Question: How do dimensionality reduction techniques like PCA (for linear transformations) and t-SNE (for non-linear visualization) fundamentally transform high-dimensional data into a lower-dimensional representation while preserving essential information, mitigating the "curse of dimensionality" and improving model performance and visualization?