Machine Learning
Linear Regression
MLE & Least Squares

Maximum Likelihood and Least Squares

Linear regression is the simplest yet most fundamental tool in machine learning for predicting a continuous target variable yy from input features xx.


1. Foundations: Random Variables

Before diving into estimation, we must distinguish between the types of data we encounter:

  • Discrete Random Variables: X{x1,x2,,xn}X \in \{x_1, x_2, \dots, x_n\}. We use a Probability Mass Function (PMF) where P(X=xi)P(X=x_i) is the probability of a specific outcome.
    • Example: A coin toss or a dice roll.
  • Continuous Random Variables: XRX \in \mathbb{R}. We use a Probability Density Function (PDF). Note that for continuous variables, P(X=x)=0P(X=x) = 0 for any specific point; we instead measure the probability over an interval.
    • Gaussian (Normal) Distribution: The most common PDF in ML: P(x)=12πσ2e12σ2(xμ)2P(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2}

2. The Dataset and Likelihood

We represent our dataset DD as a collection of mm samples: D={(x(i),y(i))}i=1mD = \{(x^{(i)}, y^{(i)})\}_{i=1}^m

For independent events, the joint probability is the product of individual probabilities: P(AB)=P(A)P(B)=P(A,B)P(A \cap B) = P(A)P(B) = P(A, B)

Accordingly, the Likelihood of our data given parameters θ\theta is: L(θ;D)=P(D;θ)L(\theta; D) = P(D; \theta)

Generative vs. Discriminative View

  • Generative Algorithms: Model the joint probability P(x,y)P(x, y).
  • Discriminative Algorithms: Model the conditional probability P(yx)P(y \mid x).

3. The Generative Model for Regression

In Linear Regression, we assume that the target variable yy is generated by: y=θTx+ϵy = \theta^T x + \epsilon

Where the noise ϵ\epsilon follows a Gaussian distribution: ϵN(0,σ2)\epsilon \sim \mathcal{N}(0, \sigma^2). This implies that yy itself is a random variable following a Gaussian distribution centered at our prediction: yN(θTx,σ2)y \sim \mathcal{N}(\theta^T x, \sigma^2)

For a dataset of mm Independent and Identically Distributed (i.i.d.) observations: L(θ;D)=i=1mP(y(i)x(i);θ)L(\theta; D) = \prod_{i=1}^m P(y^{(i)} \mid x^{(i)}; \theta)


4. Maximum Likelihood Estimation (MLE)

To find the optimal θ\theta, we maximize the Log-Likelihood (since log\log is a monotonic function, maximizing logL\log L is the same as maximizing LL):

logL(θ;D)=logi=1mP(y(i)x(i);θ)=i=1mlogP(y(i)x(i);θ)\log L(\theta; D) = \log \prod_{i=1}^m P(y^{(i)} \mid x^{(i)}; \theta) = \sum_{i=1}^m \log P(y^{(i)} \mid x^{(i)}; \theta)

For the Gaussian case: P(yx;θ)=12πσ2e12σ2(yθTx)2P(y \mid x; \theta) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(y - \theta^T x)^2}

Taking the log: logL(θ)=i=1m[log12πσ2Constant12σ2(y(i)θTx(i))2]\log L(\theta) = \sum_{i=1}^m \left[ \underbrace{\log \frac{1}{\sqrt{2\pi\sigma^2}}}_{\text{Constant}} - \frac{1}{2\sigma^2} (y^{(i)} - \theta^T x^{(i)})^2 \right]

Connection to Least Squares

Since we want to maximize this expression, and the first term is constant, we focus on the second term. Maximizing a negative value is equivalent to minimizing the positive version: maxlogL(θ)mini=1m(y(i)θTx(i))2\max \log L(\theta) \approx \min \sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2

This is the Ordinary Least Squares (OLS) objective. The solution is: θML=(XTX)1XTy\theta_{ML} = (X^T X)^{-1} X^T y

Interactive Least Squares Fit

Ordinary Least Squares (OLS) minimizes the squared sum of residuals (red dashed lines). Adjust the noise to see how it affects the fit's confidence.

True Slope
1.5
The underlying pattern
Gaussian Noise Level
2.0
Randomness in the data
MLE Fit Result: y = 1.49x + 5.09
Residuals: Minimizing the total sum of the 10 red lines.
⚠️

Goal: We want to find the parameters θ\theta for which the likelihood of the observed data is the highest.