Probabilistic Generative Models

Generative models take a different approach to classification: they try to model how the data was generated for each class. Instead of just learning a boundary, they learn the distribution of the features for each class.

1. The MLE Landscape

In Maximum Likelihood Estimation (MLE), we can categorize algorithms based on how they model the likelihood $L(\theta; D) = P(D; \theta)$ :

Discriminative $P(y \mid x)$ : Directly models the mapping from inputs to outputs.
- Linear: Gaussian noise $\to$ Linear Regression.
- Logistic: Bernoulli noise $\to$ Logistic Regression.
Generative $P(x, y)$ : Models the joint probability. Both $x$ $x$ and $y$ $y$ are treated as random variables.
- Example: Naive Bayes.

2. The Generative Approach

A generative model learns:

Class-Conditional Densities: $P(x \mid y)$ (How does the data for class $y$ look?)
Class Priors: $P(y)$ (How common is class $y$ ?)

To classify a new point $x$ , we use Bayes' Theorem to find the posterior probability $P(y \mid x)$ :

$P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)}$

Dependent Events and Decomposition

For dependent events $x$ and $y$ , the joint probability is: $P(x, y) = P(x \cap y) = P(x \mid y)P(y) = P(y \mid x)P(x)$

As explicitly detailed in the lecture notes on Maximum Likelihood Estimation mapping algorithms:

Generative Algorithms: When you model using the overall joint distribution (translating exactly to determining the probability of the entire dataset formulation), you assume $P(D; \theta) = P(x, y)$ . This mathematically merges $x, y$ as a unified probability structure.
Discriminative Algorithms: Another way is modeling strictly the outcome conditional dependency path intuitively: $P(y \mid x; \theta)$ .

3. Gaussian Class Densities

In models like Gaussian Discriminant Analysis (GDA), we assume classes follow a Gaussian distribution.

Example: Drawing Samples If we have a class with distribution $P(x) \sim \mathcal{N}(2, 1.5)$ , most values will fall within the range $[2 - 1.5, 2 + 1.5]$ (the high-density region).

Linear (LDA)

Assumes shared covariance. Resulting boundary is a straight line.

Quadratic (QDA)

Allows unique covariance per class. Resulting boundary is a curve.

If all classes share the same covariance matrix $\Sigma$ , the decision boundary is linear (LDA).
If classes have different covariance matrices $\Sigma_k$ , the decision boundary is quadratic (QDA).

Class-Conditional Densities

Generative models learn the distribution of each class independently, p(x | Ck). Bayes' rule is then used to compute the posterior p(Ck | x) for classification.

Comparison: Generative vs. Discriminative

Feature	Generative Models (e.g., GDA, Naive Bayes)	Discriminative Models (e.g., Logistic Regression)
Assumption	$P(D; \theta) = P(x, y; \theta)$	$P(D; \theta) = P(y \mid x; \theta)$
Random Variables	Both $x$ and $y$ are random variables	$y$ is a random variable, $x$ is not
Alternative Notation	-	$P(y; x, \theta)$
Goal	Model $P(x, y)$	Model $P(y \mid x)$ directly
New Samples	Can generate new data points from $P(x \mid y)$	Cannot generate new data

Bayes Rule for Naive Bayes: For generative models, we often use $P(x \mid y) = \frac{P(y \mid x)P(x)}{P(y)}$ to transform the classification problem.

Fisher's Linear Discriminant Naive Bayes