Maximum Likelihood and Least Squares
Linear regression is the simplest yet most fundamental tool in machine learning for predicting a continuous target variable from input features .
1. Foundations: Random Variables
Before diving into estimation, we must distinguish between the types of data we encounter:
- Discrete Random Variables: . We use a Probability Mass Function (PMF) where is the probability of a specific outcome.
- Example: A coin toss or a dice roll.
- Continuous Random Variables: . We use a Probability Density Function (PDF). Note that for continuous variables, for any specific point; we instead measure the probability over an interval.
- Gaussian (Normal) Distribution: The most common PDF in ML:
2. The Dataset and Likelihood
We represent our dataset as a collection of samples:
For independent events, the joint probability is the product of individual probabilities:
Accordingly, the Likelihood of our data given parameters is:
Generative vs. Discriminative View
- Generative Algorithms: Model the joint probability .
- Discriminative Algorithms: Model the conditional probability .
3. The Generative Model for Regression
In Linear Regression, we assume that the target variable is generated by:
Where the noise follows a Gaussian distribution: . This implies that itself is a random variable following a Gaussian distribution centered at our prediction:
For a dataset of Independent and Identically Distributed (i.i.d.) observations:
4. Maximum Likelihood Estimation (MLE)
To find the optimal , we maximize the Log-Likelihood (since is a monotonic function, maximizing is the same as maximizing ):
For the Gaussian case:
Taking the log:
Connection to Least Squares
Since we want to maximize this expression, and the first term is constant, we focus on the second term. Maximizing a negative value is equivalent to minimizing the positive version:
This is the Ordinary Least Squares (OLS) objective. The solution is:
Interactive Least Squares Fit
Ordinary Least Squares (OLS) minimizes the squared sum of residuals (red dashed lines). Adjust the noise to see how it affects the fit's confidence.
Goal: We want to find the parameters for which the likelihood of the observed data is the highest.