Machine Learning
Linear Regression
Bayesian Regression

Bayesian Linear Regression

Standard linear regression provides a single "best-fit" line (point estimates for parameters). Bayesian Linear Regression treats the parameters as random variables, allowing us to quantify our uncertainty about the model's parameters and its predictions.


1. The Bayesian Framework

Instead of finding a single θ\theta, we compute the Posterior Distribution over parameters using Bayes' Theorem:

p(θy,X)=p(yX,θ)p(θ)p(yX)p(\theta \mid y, X) = \frac{p(y \mid X, \theta) p(\theta)}{p(y \mid X)}

Where:

  • p(θ)p(\theta) is the Prior: Our belief about the parameters before seeing any data. Usually, we assume a zero-mean Gaussian prior: p(θ)=N(θ0,α1I)p(\theta) = \mathcal{N}(\theta \mid 0, \alpha^{-1} I).
  • p(yX,θ)p(y \mid X, \theta) is the Likelihood: How well the parameters explain the observed data.
  • p(θy,X)p(\theta \mid y, X) is the Posterior: Our updated belief after seeing the data.

2. Predictive Distribution

The real power of Bayesian regression lies in the Predictive Distribution. For a new input xx^*, we don't just get a single value yy^*; we get a distribution:

p(yx,y,X)=p(yx,θ)p(θy,X)dθp(y^* \mid x^*, y, X) = \int p(y^* \mid x^*, \theta) p(\theta \mid y, X) d\theta

This distribution is also Gaussian:

p(yx,y,X)=N(yμ,(σ)2)p(y^* \mid x^*, y, X) = \mathcal{N}(y^* \mid \mu^*, (\sigma^*)^2)

3. Visualizing Uncertainty

A major advantage of the Bayesian approach is that the model "knows what it doesn't know." The predictive variance (σ)2(\sigma^*)^2 is small in regions where we have lots of training data and grows large in regions where data is sparse.

Bayesian Predictive Uncertainty

Unlike OLS, which gives a single point estimate, Bayesian regression provides a full predictive distribution. Notice how uncertainty (shaded area) increases as we move away from the training data (x=1 to x=5).


Comparison: OLS vs. Bayesian

FeatureOrdinary Least Squares (OLS)Bayesian Linear Regression
OutputSingle line (Point estimates)Distribution of lines (Posterior)
Prior KnowledgeNot explicitly usedEncoded via the Prior p(θ)p(\theta)
UncertaintyNot directly providedQuantified via Predictive Variance
OverfittingProne if no regularizationNaturally regularized by the Prior

Regularization Connection: It can be shown that finding the MAP (Maximum A Posteriori) estimate with a Gaussian prior is mathematically equivalent to Ridge Regression. The regularization parameter λ\lambda is effectively the ratio of the noise variance to the prior variance.