Machine Learning
Linear Regression
Regularization (Ridge/Lasso)

Regularized Least Squares

When we have many features or limited data, the Ordinary Least Squares (OLS) estimator can result in very large parameter values, leading to overfitting. Regularization adds a penalty term to the error function to keep the parameters small.


1. Probabilistic Interpretation: MAP Estimation

While Ordinary Least Squares (OLS) is derived from Maximum Likelihood Estimation (MLE), regularized regression has a beautiful probabilistic foundation in Maximum A Posteriori (MAP) estimation.

In MAP, we treat the parameters θ\boldsymbol{\theta} as random variables with their own underlying Prior Distribution P(θ)P(\boldsymbol{\theta}).

The MAP Objective

Using Bayes' Rule, the posterior probability of the parameters given the data is: P(θD)P(Dθ)P(θ)P(\boldsymbol{\theta} \mid D) \propto P(D \mid \boldsymbol{\theta}) P(\boldsymbol{\theta})

To find the optimal parameters, we maximize the log-posterior: θ^MAP=argmaxθ[logP(Dθ)Log-Likelihood+logP(θ)Log-Prior]\hat{\theta}_{MAP} = \arg\max_{\boldsymbol{\theta}} \left[ \underbrace{\log P(D \mid \boldsymbol{\theta})}_{\text{Log-Likelihood}} + \underbrace{\log P(\boldsymbol{\theta})}_{\text{Log-Prior}} \right]

Ridge (L2L_2) and the Gaussian Prior

If we assume our weights θj\theta_j follow a Gaussian Distribution N(0,σ2)\mathcal{N}(0, \sigma^2), then: logP(θ)θj2\log P(\boldsymbol{\theta}) \propto - \sum \theta_j^2 Maximizing this is equivalent to minimizing the squared weights, which is exactly Ridge Regression.

Lasso (L1L_1) and the Laplacian Prior

If we assume our weights follow a Laplacian Distribution, then: logP(θ)θj\log P(\boldsymbol{\theta}) \propto - \sum |\theta_j| This corresponds to Lasso Regression, which tends to produce sparse solutions (weights set exactly to zero).

Estimationθ\theta treatmentCorresponds to...
MLEFixed ParameterOrdinary Least Squares
MAPRandom VariableRegularized Regression

2. The Regularized Objective

The general form of a regularized error function is:

E(θ)=12i=1m(y(i)θTx(i))2+λR(θ)E(\theta) = \frac{1}{2} \sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 + \lambda R(\theta)

Where:

  • λ\lambda is the regularization coefficient (controls the trade-off).
  • R(θ)R(\theta) is the penalty term.

2. Ridge Regression (L2L_2 Regularization)

Ridge regression uses a quadratic penalty:

R(θ)=12θ22=12j=1nθj2R(\theta) = \frac{1}{2} \|\theta\|^2_2 = \frac{1}{2} \sum_{j=1}^n \theta_j^2

Key Features:

  • Shrinks coefficients towards zero but never exactly to zero.
  • Has a closed-form solution: θRidge=(XTX+λI)1XTy\theta_{Ridge} = (X^T X + \lambda I)^{-1} X^T y.
  • Effectively handles multicollinearity (highly correlated features).

3. Lasso Regression (L1L_1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty:

R(θ)=θ1=j=1nθjR(\theta) = \|\theta\|_1 = \sum_{j=1}^n |\theta_j|

Key Features:

  • Performs Feature Selection: It can force some coefficients to be exactly zero.
  • Produces "sparse" models.
  • Does not have a closed-form solution (requires numerical optimization).

5. Elastic Net

Elastic Net combines both L1L_1 and L2L_2 penalties:

R(θ)=αθ1+(1α)12θ22R(\theta) = \alpha \|\theta\|_1 + (1-\alpha) \frac{1}{2} \|\theta\|^2_2

It is useful when there are multiple features which are correlated with each other.


Visualizing Shrinkage

As λ\lambda increases, the magnitude of the weights decreases.

Coefficient Shrinkage (Ridge/Lasso)

As the regularization penalty (lambda) increases, the magnitude of the model's coefficients (w) shrinks towards zero. This prevents the model from relying too heavily on any single feature, thus reducing overfitting.

⚖️

Standardization Requirement: Always scale your features (e.g., Z-score normalization) before applying regularization. Since the penalty is applied to the magnitude of the parameters θ\theta, features with larger scales will be unfairly penalized.