Regularized Least Squares

When we have many features or limited data, the Ordinary Least Squares (OLS) estimator can result in very large parameter values, leading to overfitting. Regularization adds a penalty term to the error function to keep the parameters small.

1. Probabilistic Interpretation: MAP Estimation

While Ordinary Least Squares (OLS) is derived from Maximum Likelihood Estimation (MLE), regularized regression has a beautiful probabilistic foundation in Maximum A Posteriori (MAP) estimation.

In MAP, we treat the parameters $\boldsymbol{\theta}$ as random variables with their own underlying Prior Distribution $P(\boldsymbol{\theta})$ .

The MAP Objective

Using Bayes' Rule, the posterior probability of the parameters given the data is: $P(\boldsymbol{\theta} \mid D) \propto P(D \mid \boldsymbol{\theta}) P(\boldsymbol{\theta})$

To find the optimal parameters, we maximize the log-posterior: $\hat{\theta}_{MAP} = \arg\max_{\boldsymbol{\theta}} \left[ \underbrace{\log P(D \mid \boldsymbol{\theta})}_{\text{Log-Likelihood}} + \underbrace{\log P(\boldsymbol{\theta})}_{\text{Log-Prior}} \right]$

Ridge ( $L_2$ ) and the Gaussian Prior

If we assume our weights $\theta_j$ follow a Gaussian Distribution $\mathcal{N}(0, \sigma^2)$ , then: $\log P(\boldsymbol{\theta}) \propto - \sum \theta_j^2$ Maximizing this is equivalent to minimizing the squared weights, which is exactly Ridge Regression.

Lasso ( $L_1$ ) and the Laplacian Prior

If we assume our weights follow a Laplacian Distribution, then: $\log P(\boldsymbol{\theta}) \propto - \sum |\theta_j|$ This corresponds to Lasso Regression, which tends to produce sparse solutions (weights set exactly to zero).

Estimation	$\theta$ treatment	Corresponds to...
MLE	Fixed Parameter	Ordinary Least Squares
MAP	Random Variable	Regularized Regression

2. The Regularized Objective

The general form of a regularized error function is:

E(\theta) = \frac{1}{2} \sum_{i=1}^m (y^{(i)} - \theta^T x^{(i)})^2 + \lambda R(\theta)

Where:

$\lambda$ is the regularization coefficient (controls the trade-off).
$R(\theta)$ is the penalty term.

2. Ridge Regression ( $L_2$ Regularization)

Ridge regression uses a quadratic penalty:

R(\theta) = \frac{1}{2} \|\theta\|^2_2 = \frac{1}{2} \sum_{j=1}^n \theta_j^2

Key Features:

Shrinks coefficients towards zero but never exactly to zero.
Has a closed-form solution: $\theta_{Ridge} = (X^T X + \lambda I)^{-1} X^T y$ .
Effectively handles multicollinearity (highly correlated features).

3. Lasso Regression ( $L_1$ Regularization)

Lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty:

R(\theta) = \|\theta\|_1 = \sum_{j=1}^n |\theta_j|

Key Features:

Performs Feature Selection: It can force some coefficients to be exactly zero.
Produces "sparse" models.
Does not have a closed-form solution (requires numerical optimization).

5. Elastic Net

Elastic Net combines both $L_1$ and $L_2$ penalties:

R(\theta) = \alpha \|\theta\|_1 + (1-\alpha) \frac{1}{2} \|\theta\|^2_2

It is useful when there are multiple features which are correlated with each other.

Visualizing Shrinkage

As $\lambda$ increases, the magnitude of the weights decreases.

Coefficient Shrinkage (Ridge/Lasso)

As the regularization penalty (lambda) increases, the magnitude of the model's coefficients (w) shrinks towards zero. This prevents the model from relying too heavily on any single feature, thus reducing overfitting.

⚖️

Standardization Requirement: Always scale your features (e.g., Z-score normalization) before applying regularization. Since the penalty is applied to the magnitude of the parameters $\theta$ , features with larger scales will be unfairly penalized.

Polynomial Regression Bias-Variance Tradeoff