Copyright (c) 2025 MindMesh Academy. All rights reserved. This content is proprietary and may not be reproduced or distributed without permission.

4.1.1. Regression Algorithms (Linear, Logistic, XGBoost)

First Principle: Regression algorithms fundamentally learn to predict a continuous numerical output by modeling the relationships between input features and known target values, enabling quantitative forecasts.

Regression algorithms are a subset of supervised learning used when the target variable is a continuous numerical value (e.g., house price, temperature, sales revenue, number of returns).

Key Regression Algorithms and their characteristics:
  • Linear Regression:
    • What it is: Models a linear relationship between input features (independent variables) and the continuous target (dependent variable). It finds the "best fit" line or hyperplane.
    • Assumptions: Linear relationship, independent observations, normality of residuals, homoscedasticity.
    • AWS: SageMaker Linear Learner (can do linear regression for dense or sparse data).
  • Logistic Regression:
    • What it is: Despite "regression" in its name, it's primarily used for binary classification. It models the probability of a binary outcome using a logistic function.
    • Output: Probabilities (0 to 1), which can be converted to classes using a threshold.
    • AWS: SageMaker Linear Learner can be configured for binary or multi-class classification.
  • XGBoost (Extreme Gradient Boosting):
    • What it is: A powerful, highly optimized, and scalable open-source implementation of the gradient boosting framework. It builds an ensemble of decision trees sequentially.
    • Strengths: Handles various data types, robust to outliers, handles missing values, high performance, capable of capturing complex non-linear relationships.
    • Parameters: Many hyperparameters to tune (e.g., num_round, eta, max_depth, subsample, colsample_bytree).
    • AWS: SageMaker XGBoost is a built-in algorithm, highly optimized for large datasets and distributed training.
  • Metrics for Regression: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), R-squared.

Scenario: You need to build a model to predict the expected demand for a new product launch (a continuous numerical value). The marketing team also wants to predict the probability of a customer clicking on an ad campaign (a binary outcome).

Reflection Question: How do regression algorithms like Linear Regression or XGBoost (for continuous predictions) and Logistic Regression (for binary probabilities) fundamentally learn to predict a numerical output or probability by modeling the relationships between input features and known target values, enabling quantitative forecasts and classification probabilities?