Regularized Least Squares
When we have many features or limited data, the Ordinary Least Squares (OLS) estimator can result in very large parameter values, leading to overfitting. Regularization adds a penalty term to the error function to keep the parameters small.
1. Probabilistic Interpretation: MAP Estimation
While Ordinary Least Squares (OLS) is derived from Maximum Likelihood Estimation (MLE), regularized regression has a beautiful probabilistic foundation in Maximum A Posteriori (MAP) estimation.
In MAP, we treat the parameters as random variables with their own underlying Prior Distribution .
The MAP Objective
Using Bayes' Rule, the posterior probability of the parameters given the data is:
To find the optimal parameters, we maximize the log-posterior:
Ridge () and the Gaussian Prior
If we assume our weights follow a Gaussian Distribution , then: Maximizing this is equivalent to minimizing the squared weights, which is exactly Ridge Regression.
Lasso () and the Laplacian Prior
If we assume our weights follow a Laplacian Distribution, then: This corresponds to Lasso Regression, which tends to produce sparse solutions (weights set exactly to zero).
| Estimation | treatment | Corresponds to... |
|---|---|---|
| MLE | Fixed Parameter | Ordinary Least Squares |
| MAP | Random Variable | Regularized Regression |
2. The Regularized Objective
The general form of a regularized error function is:
Where:
- is the regularization coefficient (controls the trade-off).
- is the penalty term.
2. Ridge Regression ( Regularization)
Ridge regression uses a quadratic penalty:
Key Features:
- Shrinks coefficients towards zero but never exactly to zero.
- Has a closed-form solution: .
- Effectively handles multicollinearity (highly correlated features).
3. Lasso Regression ( Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty:
Key Features:
- Performs Feature Selection: It can force some coefficients to be exactly zero.
- Produces "sparse" models.
- Does not have a closed-form solution (requires numerical optimization).
5. Elastic Net
Elastic Net combines both and penalties:
It is useful when there are multiple features which are correlated with each other.
Visualizing Shrinkage
As increases, the magnitude of the weights decreases.
Coefficient Shrinkage (Ridge/Lasso)
As the regularization penalty (lambda) increases, the magnitude of the model's coefficients (w) shrinks towards zero. This prevents the model from relying too heavily on any single feature, thus reducing overfitting.
Standardization Requirement: Always scale your features (e.g., Z-score normalization) before applying regularization. Since the penalty is applied to the magnitude of the parameters , features with larger scales will be unfairly penalized.