Machine Learning
Linear Regression
Regularization (Ridge/Lasso)

Regularized Least Squares

When we have many features or limited data, the Ordinary Least Squares (OLS) estimator can result in very large parameter values, leading to overfitting. Regularization adds a penalty term to the error function to keep the parameters small.


1. The Regularized Objective

The general form of a regularized error function is:

E(θ)=12i=1m(y(i)θTx(i))2+λR(θ)E(\theta) = \frac{1}{2} \sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 + \lambda R(\theta)

Where:

  • λ\lambda is the regularization coefficient (controls the trade-off).
  • R(θ)R(\theta) is the penalty term.

2. Ridge Regression (L2L_2 Regularization)

Ridge regression uses a quadratic penalty:

R(θ)=12θ22=12j=1nθj2R(\theta) = \frac{1}{2} \|\theta\|^2_2 = \frac{1}{2} \sum_{j=1}^n \theta_j^2

Key Features:

  • Shrinks coefficients towards zero but never exactly to zero.
  • Has a closed-form solution: θRidge=(XTX+λI)1XTy\theta_{Ridge} = (X^T X + \lambda I)^{-1} X^T y.
  • Effectively handles multicollinearity (highly correlated features).

3. Lasso Regression (L1L_1 Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty:

R(θ)=θ1=j=1nθjR(\theta) = \|\theta\|_1 = \sum_{j=1}^n |\theta_j|

Key Features:

  • Performs Feature Selection: It can force some coefficients to be exactly zero.
  • Produces "sparse" models.
  • Does not have a closed-form solution (requires numerical optimization).

4. Elastic Net

Elastic Net combines both L1L_1 and L2L_2 penalties:

R(θ)=αθ1+(1α)12θ22R(\theta) = \alpha \|\theta\|_1 + (1-\alpha) \frac{1}{2} \|\theta\|^2_2

It is useful when there are multiple features which are correlated with each other.


Visualizing Shrinkage

As λ\lambda increases, the magnitude of the weights decreases.

Coefficient Shrinkage (Ridge/Lasso)

As the regularization penalty (lambda) increases, the magnitude of the model's coefficients (w) shrinks towards zero. This prevents the model from relying too heavily on any single feature, thus reducing overfitting.

⚖️

Standardization Requirement: Always scale your features (e.g., Z-score normalization) before applying regularization. Since the penalty is applied to the magnitude of the parameters θ\theta, features with larger scales will be unfairly penalized.