Bayesian Linear Regression
Standard linear regression provides a single "best-fit" line (point estimates for parameters). Bayesian Linear Regression treats the parameters as random variables, allowing us to quantify our uncertainty about the model's parameters and its predictions.
1. The Bayesian Framework
Instead of finding a single , we compute the Posterior Distribution over parameters using Bayes' Theorem:
Where:
- is the Prior: Our belief about the parameters before seeing any data. Usually, we assume a zero-mean Gaussian prior: .
- is the Likelihood: How well the parameters explain the observed data.
- is the Posterior: Our updated belief after seeing the data.
2. Predictive Distribution
The real power of Bayesian regression lies in the Predictive Distribution. For a new input , we don't just get a single value ; we get a distribution:
This distribution is also Gaussian:
3. Visualizing Uncertainty
A major advantage of the Bayesian approach is that the model "knows what it doesn't know." The predictive variance is small in regions where we have lots of training data and grows large in regions where data is sparse.
Bayesian Predictive Uncertainty
Unlike OLS, which gives a single point estimate, Bayesian regression provides a full predictive distribution. Notice how uncertainty (shaded area) increases as we move away from the training data (x=1 to x=5).
Comparison: OLS vs. Bayesian
| Feature | Ordinary Least Squares (OLS) | Bayesian Linear Regression |
|---|---|---|
| Output | Single line (Point estimates) | Distribution of lines (Posterior) |
| Prior Knowledge | Not explicitly used | Encoded via the Prior |
| Uncertainty | Not directly provided | Quantified via Predictive Variance |
| Overfitting | Prone if no regularization | Naturally regularized by the Prior |
Regularization Connection: It can be shown that finding the MAP (Maximum A Posteriori) estimate with a Gaussian prior is mathematically equivalent to Ridge Regression. The regularization parameter is effectively the ratio of the noise variance to the prior variance.