Regularized Least Squares
When we have many features or limited data, the Ordinary Least Squares (OLS) estimator can result in very large parameter values, leading to overfitting. Regularization adds a penalty term to the error function to keep the parameters small.
1. The Regularized Objective
The general form of a regularized error function is:
Where:
- is the regularization coefficient (controls the trade-off).
- is the penalty term.
2. Ridge Regression ( Regularization)
Ridge regression uses a quadratic penalty:
Key Features:
- Shrinks coefficients towards zero but never exactly to zero.
- Has a closed-form solution: .
- Effectively handles multicollinearity (highly correlated features).
3. Lasso Regression ( Regularization)
Lasso (Least Absolute Shrinkage and Selection Operator) uses an absolute value penalty:
Key Features:
- Performs Feature Selection: It can force some coefficients to be exactly zero.
- Produces "sparse" models.
- Does not have a closed-form solution (requires numerical optimization).
4. Elastic Net
Elastic Net combines both and penalties:
It is useful when there are multiple features which are correlated with each other.
Visualizing Shrinkage
As increases, the magnitude of the weights decreases.
Coefficient Shrinkage (Ridge/Lasso)
As the regularization penalty (lambda) increases, the magnitude of the model's coefficients (w) shrinks towards zero. This prevents the model from relying too heavily on any single feature, thus reducing overfitting.
Standardization Requirement: Always scale your features (e.g., Z-score normalization) before applying regularization. Since the penalty is applied to the magnitude of the parameters , features with larger scales will be unfairly penalized.