3.1.1. Regression: Predicting Numbers
Regression predicts continuous numeric values. Think of it as drawing a line through data points to predict where future points will fall. The "regression" name comes from statistics—the model "regresses" toward predicting average behavior.
Key characteristics:
- Predicts numeric/continuous values
- Output is a number (price, temperature, probability value)
- Uses labeled training data (supervised learning)
- Measures error as distance from prediction (how far off was the guess?)
How regression works:
- Training data provides examples: features (inputs) and labels (correct numeric outputs)
- The algorithm finds a mathematical function that best fits the data
- For new data, the function predicts the numeric output
Common scenarios:
- Predicting house prices based on square footage, bedrooms, location
- Forecasting sales numbers for next quarter
- Estimating delivery times based on distance and traffic
- Predicting patient blood pressure based on age, weight, exercise
The following table shows regression types:
| Regression Type | When to Use | Example |
|---|---|---|
| Linear Regression | One feature predicts one label | Square footage → house price |
| Multiple Linear Regression | Multiple features predict one label | Square footage + bedrooms + age → house price |
Features vs. Labels in Regression:
- Features (inputs): The data points you know (square footage, number of bedrooms)
- Label (output): The numeric value you're predicting (price)
- Training: Model learns the relationship between features and labels
- Inference: Model predicts labels for new feature combinations
Example: Predicting ice cream sales
- Features: temperature, day of week, is_holiday
- Label: number of ice creams sold
- The model learns: "When temperature rises, sales increase"
Evaluating regression models:
| Metric | What It Measures | Interpretation |
|---|---|---|
| MAE (Mean Absolute Error) | Average error magnitude | Lower is better |
| RMSE (Root Mean Square Error) | Error with outlier penalty | Lower is better |
| R² (R-squared) | Variance explained | 0-1, higher is better |
When regression fails:
- Non-linear relationships (try polynomial regression)
- Missing important features
- Outliers skewing predictions
- Correlated features (multicollinearity)
Regression vs. Classification decision:
- "How much will this cost?" → Regression (numeric)
- "Will this customer churn?" → Classification (yes/no)
- "What's the probability of rain?" → Could be either!
- If you need the probability NUMBER (73%) → Regression
- If you need the CATEGORY (will rain/won't rain) → Classification
⚠️ Critical Exam Trap: For multiple linear regression, features must be INDEPENDENT of each other. If features are correlated (dependent), predictions will be misleading. For example, if "square footage" and "number of rooms" are highly correlated, the model may give unreliable results.
⚠️ Exam Tip: If the question asks about predicting a NUMBER (price, count, amount, percentage), the answer is regression. If predicting a CATEGORY (yes/no, type, class), it's classification.