3.3. Feature Engineering Techniques
First Principle: Feature Engineering fundamentally transforms raw data into a richer, more informative set of features, directly enhancing the learning capabilities of ML algorithms and improving model performance.
Feature engineering is the process of using domain knowledge to extract new features from raw data that make machine learning algorithms work better. It is often considered one of the most critical steps in the ML workflow.
Key Concepts of Feature Engineering:
- Purpose: Create new features that represent underlying patterns more effectively, improve model accuracy, and enable algorithms to learn from the data more efficiently.
- Iterative Process: Often involves experimentation and domain expertise.
- Types of Transformations:
- Categorical: Encoding categorical variables into numerical format.
- Numerical: Binning, polynomial features, interactions.
- Text: TF-IDF, word embeddings.
- Date/Time: Extracting day of week, month, year, time differences.
- Aggregations: Creating summary statistics (min, max, average, count) from related data.
AWS Tools for Feature Engineering:
- Amazon SageMaker Data Wrangler: Provides a visual interface with a wide array of built-in transformations for various feature engineering tasks (e.g., one-hot encoding, binning, text processing).
- Amazon SageMaker Processing Jobs: Ideal for running large-scale, custom feature engineering scripts using Spark or Scikit-learn in a managed environment.
- Amazon SageMaker Feature Store: A centralized repository for storing and serving curated features for both training and inference, ensuring consistency and preventing data leakage.
- AWS Glue ETL Jobs: For batch feature engineering as part of a larger data pipeline, especially for data stored in S3 or Redshift.
- SageMaker Notebook Instances / Studio Notebooks: For interactive experimentation with feature engineering on smaller datasets.
Scenario: You are building a model to predict user engagement on a website. Your raw data includes timestamps of user visits, user IDs, and free-text search queries. You need to create new features such as "time of day," "day of week," "number of visits in the last 7 days," and "length of search query."
Reflection Question: How does feature engineering, by transforming raw data into a richer set of features (e.g., deriving temporal features from timestamps, text features from search queries, using SageMaker Data Wrangler), fundamentally enhance the learning capabilities of ML algorithms and improve model performance?