4.1.1. Regression Algorithms (Linear, Logistic, XGBoost)
First Principle: Regression algorithms fundamentally learn to predict a continuous numerical output by modeling the relationships between input features and known target values, enabling quantitative forecasts.
Regression algorithms are a subset of supervised learning used when the target variable is a continuous numerical value (e.g., house price, temperature, sales revenue, number of returns).
Key Regression Algorithms and their characteristics:
- Linear Regression:
- What it is: Models a linear relationship between input features (independent variables) and the continuous target (dependent variable). It finds the "best fit" line or hyperplane.
- Assumptions: Linear relationship, independent observations, normality of residuals, homoscedasticity.
- AWS: SageMaker Linear Learner (can do linear regression for dense or sparse data).
- Logistic Regression:
- What it is: Despite "regression" in its name, it's primarily used for binary classification. It models the probability of a binary outcome using a logistic function.
- Output: Probabilities (0 to 1), which can be converted to classes using a threshold.
- AWS: SageMaker Linear Learner can be configured for binary or multi-class classification.
- XGBoost (Extreme Gradient Boosting):
- What it is: A powerful, highly optimized, and scalable open-source implementation of the gradient boosting framework. It builds an ensemble of decision trees sequentially.
- Strengths: Handles various data types, robust to outliers, handles missing values, high performance, capable of capturing complex non-linear relationships.
- Parameters: Many hyperparameters to tune (e.g.,
num_round
,eta
,max_depth
,subsample
,colsample_bytree
). - AWS: SageMaker XGBoost is a built-in algorithm, highly optimized for large datasets and distributed training.
- Metrics for Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared.
Scenario: You need to build a model to predict the expected demand for a new product launch (a continuous numerical value). The marketing team also wants to predict the probability of a customer clicking on an ad campaign (a binary outcome).
Reflection Question: How do regression algorithms like Linear Regression or XGBoost (for continuous predictions) and Logistic Regression (for binary probabilities) fundamentally learn to predict a numerical output or probability by modeling the relationships between input features and known target values, enabling quantitative forecasts and classification probabilities?