Bayesian Linear Regression

Standard linear regression provides a single "best-fit" line (point estimates for parameters). Bayesian Linear Regression treats the parameters as random variables, allowing us to quantify our uncertainty about the model's parameters and its predictions.

1. The Bayesian Framework

Instead of finding a single $\theta$ , we compute the Posterior Distribution over parameters using Bayes' Theorem:

p(\theta \mid y, X) = \frac{p(y \mid X, \theta) p(\theta)}{p(y \mid X)}

Where:

$p(\theta)$ is the Prior: Our belief about the parameters before seeing any data. Usually, we assume a zero-mean Gaussian prior: $p(\theta) = \mathcal{N}(\theta \mid 0, \alpha^{-1} I)$ .
$p(y \mid X, \theta)$ is the Likelihood: How well the parameters explain the observed data.
$p(\theta \mid y, X)$ is the Posterior: Our updated belief after seeing the data.

2. Predictive Distribution

The real power of Bayesian regression lies in the Predictive Distribution. For a new input $x^*$ , we don't just get a single value $y^*$ ; we get a distribution:

p(y^* \mid x^*, y, X) = \int p(y^* \mid x^*, \theta) p(\theta \mid y, X) d\theta

This distribution is also Gaussian:

p(y^* \mid x^*, y, X) = \mathcal{N}(y^* \mid \mu^*, (\sigma^*)^2)

3. Visualizing Uncertainty

A major advantage of the Bayesian approach is that the model "knows what it doesn't know." The predictive variance $(\sigma^*)^2$ is small in regions where we have lots of training data and grows large in regions where data is sparse.

Bayesian Predictive Uncertainty

Unlike OLS, which gives a single point estimate, Bayesian regression provides a full predictive distribution. Notice how uncertainty (shaded area) increases as we move away from the training data (x=1 to x=5).

Comparison: OLS vs. Bayesian

Feature	Ordinary Least Squares (OLS)	Bayesian Linear Regression
Output	Single line (Point estimates)	Distribution of lines (Posterior)
Prior Knowledge	Not explicitly used	Encoded via the Prior $p(\theta)$
Uncertainty	Not directly provided	Quantified via Predictive Variance
Overfitting	Prone if no regularization	Naturally regularized by the Prior

Regularization Connection: It can be shown that finding the MAP (Maximum A Posteriori) estimate with a Gaussian prior is mathematically equivalent to Ridge Regression. The regularization parameter $\lambda$ is effectively the ratio of the noise variance to the prior variance.

Bias-Variance Tradeoff Introduction to Classification