Maximum Likelihood and Least Squares

Linear regression is the simplest yet most fundamental tool in machine learning for predicting a continuous target variable $y$ from input features $x$ .

1. Foundations: Random Variables

Before diving into estimation, we must distinguish between the types of data we encounter:

Discrete Random Variables: $X \in \{x_1, x_2, \dots, x_n\}$ $X \in {x_{1}, x_{2}, \dots, x_{n}}$ . We use a Probability Mass Function (PMF) where $P(X=x_i)$ $P (X = x_{i})$ is the probability of a specific outcome.
- Example: A coin toss or a dice roll.
Continuous Random Variables: $X \in \mathbb{R}$ $X \in R$ . We use a Probability Density Function (PDF). Note that for continuous variables, $P(X=x) = 0$ $P (X = x) = 0$ for any specific point; we instead measure the probability over an interval.
- Gaussian (Normal) Distribution: The most common PDF in ML: $P(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2}$

2. Dataset Representation

In linear regression, we represent our dataset $D$ as a collection of $m$ training examples. Each example $i$ consists of an input feature vector $\mathbf{x}^{(i)}$ and a corresponding target value $y^{(i)}$ .

$D = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m$

Matrix Notation

For a model with $n$ features, we define the Design Matrix $X$ and the Target Vector $\mathbf{y}$ as:

$X = \begin{bmatrix} \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(1)})^T & \rule[.5ex]{2em}{0.4pt} \\ \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(2)})^T & \rule[.5ex]{2em}{0.4pt} \\ & \vdots & \\ \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(m)})^T & \rule[.5ex]{2em}{0.4pt} \end{bmatrix} \in \mathbb{R}^{m \times (n+1)}, \quad \mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m \times 1}$

To handle the bias term (intercept $\theta_0$ ), we prepend a dummy feature $x_0 = 1$ to every input vector, so that $\mathbf{x}^{(i)} \in \mathbb{R}^{n+1}$ .

3. The Hypothesis and Cost Function

Our Hypothesis is a linear combination of features: $h_{\theta}(x) = \theta_0 x_0 + \theta_1 x_1 + \dots + \theta_n x_n = \sum_{j=0}^n \theta_j x_j = \boldsymbol{\theta}^T \mathbf{x}$

The Cost Function $J(\boldsymbol{\theta})$

We want to measure the "average error" of our predictions across the entire dataset. We use the Sum of Squared Errors (SSE):

$J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^m \left( h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}) - y^{(i)} \right)^2$

📐

Why Squared Error? Squaring the error is an order-preserving transformation. While we could use absolute error ( $|h-y|$ ), squaring makes the function smooth (differentiable) and convex, which guarantees a single global minimum that can be easily found using calculus.

4. Optimization I: Gradient Descent

Gradient Descent is an iterative algorithm that starts with random parameters and moves them in the direction that most steeply decreases the cost $J(\boldsymbol{\theta})$ .

The Update Rule

For every parameter $\theta_j$ (where $j = 0, \dots, n$ ): $\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\boldsymbol{\theta})$

Where $\alpha$ is the Learning Rate, controlling the size of our "steps" down the gradient.

The Gradient Derivation

Computing the partial derivative for a single example $(x, y)$ : $\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$

This results in the Batch Gradient Descent update: $\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}$

5. Mathematical Proof of Convexity

Why does Gradient Descent work so effectively for Linear Regression? It’s because the cost function $J(\boldsymbol{\theta})$ is Convex, meaning any local minimum is also a global minimum.

The Hessian Matrix

To prove convexity, we examine the Hessian Matrix $\mathbf{H}$ , which is the matrix of second-order partial derivatives: $\mathbf{H} = \nabla^2 J(\boldsymbol{\theta})$

For the squared error cost function, the Hessian can be derived as: $\mathbf{H} = \frac{1}{m} X^T X$

Positive Semi-Definiteness (PSD)

A function is convex if its Hessian is Positive Semi-Definite. For any non-zero vector $\mathbf{z}$ : $\mathbf{z}^T \mathbf{H} \mathbf{z} = \mathbf{z}^T \left( \frac{1}{m} X^T X \right) \mathbf{z} = \frac{1}{m} (X\mathbf{z})^T (X\mathbf{z}) = \frac{1}{m} \|X\mathbf{z}\|^2 \ge 0$

Since $\|X\mathbf{z}\|^2$ is always non-negative, the Hessian is PSD, proving that the cost function is a "bowl-shaped" convex surface.

6. Optimization Variants

While Batch Gradient Descent is standard, several variants offer different computational trade-offs:

Variant	Logic	Computational Complexity	Convergence
Full Batch	Uses all $m$ examples for every update.	$O(m \cdot n)$	Smooth & Stable
Mini-Batch	Uses a small subset (e.g., 32 or 64 samples).	$O(\text{batch\_size} \cdot n)$	Efficient balance
Stochastic (SGD)	Uses exactly one sample per update.	$O(n)$	Fast but "noisy"

The Learning Framework: Optimization is a heuristic process following the cycle: Assumption (Model type) $\to$ Evaluation (Cost Calculation) $\to$ Refinement (Weight Updates). If the evaluation metric is satisfied, the process stops.

7. Optimization II: Normal Equation (Closed Form)

Alternatively, we can solve for the optimal $\boldsymbol{\theta}$ analytically. By setting the gradient of the cost function to zero, we derive the Normal Equation:

$\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = 0 \implies \boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}$

Derivation Steps:

Represent $J(\theta)$ in matrix form: $J(\theta) = \frac{1}{2m} (X\theta - y)^T (X\theta - y)$ .
Compute the derivative with respect to the vector $\theta$ .
Set the result to zero and solve for $\theta$ .

Method	Gradient Descent	Normal Equation
Complexity	$O(kn^2)$ (iterative)	$O(n^3)$ (matrix inversion)
Scaling	Works well with large $n$	Very slow if $n$ is very large
Hyperparameters	Requires choosing $\alpha$	No hyperparameters

Interactive Least Squares Fit

Ordinary Least Squares (OLS) minimizes the squared sum of residuals (red dashed lines). Adjust the noise to see how it affects the fit's confidence.

True Slope

1.5

The underlying pattern

Gaussian Noise Level

2.0

Randomness in the data

MLE Fit Result: y = 1.49x + 5.09

Residuals: Minimizing the total sum of the 10 red lines.

⚠️

Goal: We want to find the parameters $\theta$ for which the likelihood of the observed data is the highest.

Regularization Polynomial Regression