Machine Learning
Linear Regression
MLE & Least Squares

Maximum Likelihood and Least Squares

Linear regression is the simplest yet most fundamental tool in machine learning for predicting a continuous target variable yy from input features xx.


1. Foundations: Random Variables

Before diving into estimation, we must distinguish between the types of data we encounter:

  • Discrete Random Variables: X{x1,x2,,xn}X \in \{x_1, x_2, \dots, x_n\}. We use a Probability Mass Function (PMF) where P(X=xi)P(X=x_i) is the probability of a specific outcome.
    • Example: A coin toss or a dice roll.
  • Continuous Random Variables: XRX \in \mathbb{R}. We use a Probability Density Function (PDF). Note that for continuous variables, P(X=x)=0P(X=x) = 0 for any specific point; we instead measure the probability over an interval.
    • Gaussian (Normal) Distribution: The most common PDF in ML: P(x)=12πσ2e12σ2(xμ)2P(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{1}{2\sigma^2}(x-\mu)^2}

2. Dataset Representation

In linear regression, we represent our dataset DD as a collection of mm training examples. Each example ii consists of an input feature vector x(i)\mathbf{x}^{(i)} and a corresponding target value y(i)y^{(i)}.

D={(x(i),y(i))}i=1mD = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m

Matrix Notation

For a model with nn features, we define the Design Matrix XX and the Target Vector y\mathbf{y} as:

X=[(x(1))T(x(2))T(x(m))T]Rm×(n+1),y=[y(1)y(2)y(m)]Rm×1X = \begin{bmatrix} \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(1)})^T & \rule[.5ex]{2em}{0.4pt} \\ \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(2)})^T & \rule[.5ex]{2em}{0.4pt} \\ & \vdots & \\ \rule[.5ex]{2em}{0.4pt} & (\mathbf{x}^{(m)})^T & \rule[.5ex]{2em}{0.4pt} \end{bmatrix} \in \mathbb{R}^{m \times (n+1)}, \quad \mathbf{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix} \in \mathbb{R}^{m \times 1}

To handle the bias term (intercept θ0\theta_0), we prepend a dummy feature x0=1x_0 = 1 to every input vector, so that x(i)Rn+1\mathbf{x}^{(i)} \in \mathbb{R}^{n+1}.


3. The Hypothesis and Cost Function

Our Hypothesis is a linear combination of features: hθ(x)=θ0x0+θ1x1++θnxn=j=0nθjxj=θTxh_{\theta}(x) = \theta_0 x_0 + \theta_1 x_1 + \dots + \theta_n x_n = \sum_{j=0}^n \theta_j x_j = \boldsymbol{\theta}^T \mathbf{x}

The Cost Function J(θ)J(\boldsymbol{\theta})

We want to measure the "average error" of our predictions across the entire dataset. We use the Sum of Squared Errors (SSE):

J(θ)=12mi=1m(hθ(x(i))y(i))2J(\boldsymbol{\theta}) = \frac{1}{2m} \sum_{i=1}^m \left( h_{\boldsymbol{\theta}}(\mathbf{x}^{(i)}) - y^{(i)} \right)^2

📐

Why Squared Error? Squaring the error is an order-preserving transformation. While we could use absolute error (hy|h-y|), squaring makes the function smooth (differentiable) and convex, which guarantees a single global minimum that can be easily found using calculus.


4. Optimization I: Gradient Descent

Gradient Descent is an iterative algorithm that starts with random parameters and moves them in the direction that most steeply decreases the cost J(θ)J(\boldsymbol{\theta}).

The Update Rule

For every parameter θj\theta_j (where j=0,,nj = 0, \dots, n): θj:=θjαθjJ(θ)\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\boldsymbol{\theta})

Where α\alpha is the Learning Rate, controlling the size of our "steps" down the gradient.

The Gradient Derivation

Computing the partial derivative for a single example (x,y)(x, y): Jθj=1mi=1m(hθ(x(i))y(i))xj(i)\frac{\partial J}{\partial \theta_j} = \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}

This results in the Batch Gradient Descent update: θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)\theta_j := \theta_j - \alpha \frac{1}{m} \sum_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)}) x_j^{(i)}


5. Mathematical Proof of Convexity

Why does Gradient Descent work so effectively for Linear Regression? It’s because the cost function J(θ)J(\boldsymbol{\theta}) is Convex, meaning any local minimum is also a global minimum.

The Hessian Matrix

To prove convexity, we examine the Hessian Matrix H\mathbf{H}, which is the matrix of second-order partial derivatives: H=2J(θ)\mathbf{H} = \nabla^2 J(\boldsymbol{\theta})

For the squared error cost function, the Hessian can be derived as: H=1mXTX\mathbf{H} = \frac{1}{m} X^T X

Positive Semi-Definiteness (PSD)

A function is convex if its Hessian is Positive Semi-Definite. For any non-zero vector z\mathbf{z}: zTHz=zT(1mXTX)z=1m(Xz)T(Xz)=1mXz20\mathbf{z}^T \mathbf{H} \mathbf{z} = \mathbf{z}^T \left( \frac{1}{m} X^T X \right) \mathbf{z} = \frac{1}{m} (X\mathbf{z})^T (X\mathbf{z}) = \frac{1}{m} \|X\mathbf{z}\|^2 \ge 0

Since Xz2\|X\mathbf{z}\|^2 is always non-negative, the Hessian is PSD, proving that the cost function is a "bowl-shaped" convex surface.


6. Optimization Variants

While Batch Gradient Descent is standard, several variants offer different computational trade-offs:

VariantLogicComputational ComplexityConvergence
Full BatchUses all mm examples for every update.O(mn)O(m \cdot n)Smooth & Stable
Mini-BatchUses a small subset (e.g., 32 or 64 samples).O(batch_sizen)O(\text{batch\_size} \cdot n)Efficient balance
Stochastic (SGD)Uses exactly one sample per update.O(n)O(n)Fast but "noisy"

The Learning Framework: Optimization is a heuristic process following the cycle: Assumption (Model type) \to Evaluation (Cost Calculation) \to Refinement (Weight Updates). If the evaluation metric is satisfied, the process stops.


7. Optimization II: Normal Equation (Closed Form)

Alternatively, we can solve for the optimal θ\boldsymbol{\theta} analytically. By setting the gradient of the cost function to zero, we derive the Normal Equation:

θJ(θ)=0    θ=(XTX)1XTy\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = 0 \implies \boldsymbol{\theta} = (X^T X)^{-1} X^T \mathbf{y}

Derivation Steps:

  1. Represent J(θ)J(\theta) in matrix form: J(θ)=12m(Xθy)T(Xθy)J(\theta) = \frac{1}{2m} (X\theta - y)^T (X\theta - y).
  2. Compute the derivative with respect to the vector θ\theta.
  3. Set the result to zero and solve for θ\theta.
MethodGradient DescentNormal Equation
ComplexityO(kn2)O(kn^2) (iterative)O(n3)O(n^3) (matrix inversion)
ScalingWorks well with large nnVery slow if nn is very large
HyperparametersRequires choosing α\alphaNo hyperparameters

Interactive Least Squares Fit

Ordinary Least Squares (OLS) minimizes the squared sum of residuals (red dashed lines). Adjust the noise to see how it affects the fit's confidence.

True Slope
1.5
The underlying pattern
Gaussian Noise Level
2.0
Randomness in the data
MLE Fit Result: y = 1.49x + 5.09
Residuals: Minimizing the total sum of the 10 red lines.
⚠️

Goal: We want to find the parameters θ\theta for which the likelihood of the observed data is the highest.