Maximum Likelihood and Least Squares
Linear regression is the simplest yet most fundamental tool in machine learning for predicting a continuous target variable from input features .
1. Foundations: Random Variables
Before diving into estimation, we must distinguish between the types of data we encounter:
- Discrete Random Variables: . We use a Probability Mass Function (PMF) where is the probability of a specific outcome.
- Example: A coin toss or a dice roll.
- Continuous Random Variables: . We use a Probability Density Function (PDF). Note that for continuous variables, for any specific point; we instead measure the probability over an interval.
- Gaussian (Normal) Distribution: The most common PDF in ML:
2. Dataset Representation
In linear regression, we represent our dataset as a collection of training examples. Each example consists of an input feature vector and a corresponding target value .
Matrix Notation
For a model with features, we define the Design Matrix and the Target Vector as:
To handle the bias term (intercept ), we prepend a dummy feature to every input vector, so that .
3. The Hypothesis and Cost Function
Our Hypothesis is a linear combination of features:
The Cost Function
We want to measure the "average error" of our predictions across the entire dataset. We use the Sum of Squared Errors (SSE):
Why Squared Error? Squaring the error is an order-preserving transformation. While we could use absolute error (), squaring makes the function smooth (differentiable) and convex, which guarantees a single global minimum that can be easily found using calculus.
4. Optimization I: Gradient Descent
Gradient Descent is an iterative algorithm that starts with random parameters and moves them in the direction that most steeply decreases the cost .
The Update Rule
For every parameter (where ):
Where is the Learning Rate, controlling the size of our "steps" down the gradient.
The Gradient Derivation
Computing the partial derivative for a single example :
This results in the Batch Gradient Descent update:
5. Mathematical Proof of Convexity
Why does Gradient Descent work so effectively for Linear Regression? It’s because the cost function is Convex, meaning any local minimum is also a global minimum.
The Hessian Matrix
To prove convexity, we examine the Hessian Matrix , which is the matrix of second-order partial derivatives:
For the squared error cost function, the Hessian can be derived as:
Positive Semi-Definiteness (PSD)
A function is convex if its Hessian is Positive Semi-Definite. For any non-zero vector :
Since is always non-negative, the Hessian is PSD, proving that the cost function is a "bowl-shaped" convex surface.
6. Optimization Variants
While Batch Gradient Descent is standard, several variants offer different computational trade-offs:
| Variant | Logic | Computational Complexity | Convergence |
|---|---|---|---|
| Full Batch | Uses all examples for every update. | Smooth & Stable | |
| Mini-Batch | Uses a small subset (e.g., 32 or 64 samples). | Efficient balance | |
| Stochastic (SGD) | Uses exactly one sample per update. | Fast but "noisy" |
The Learning Framework: Optimization is a heuristic process following the cycle: Assumption (Model type) Evaluation (Cost Calculation) Refinement (Weight Updates). If the evaluation metric is satisfied, the process stops.
7. Optimization II: Normal Equation (Closed Form)
Alternatively, we can solve for the optimal analytically. By setting the gradient of the cost function to zero, we derive the Normal Equation:
Derivation Steps:
- Represent in matrix form: .
- Compute the derivative with respect to the vector .
- Set the result to zero and solve for .
| Method | Gradient Descent | Normal Equation |
|---|---|---|
| Complexity | (iterative) | (matrix inversion) |
| Scaling | Works well with large | Very slow if is very large |
| Hyperparameters | Requires choosing | No hyperparameters |
Interactive Least Squares Fit
Ordinary Least Squares (OLS) minimizes the squared sum of residuals (red dashed lines). Adjust the noise to see how it affects the fit's confidence.
Goal: We want to find the parameters for which the likelihood of the observed data is the highest.