Machine Learning
Linear Classification
Naive Bayes

Naive Bayes

Naive Bayes is a probabilistic Generative Model for multi-class classification. It treats both inputs xx and outputs yy as random variables.


1. Bayes' Rule and Generative ML

For generative models, we use Bayes' Rule to find the posterior:

P(yx)=P(xy)P(y)P(x)P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)}

In classification, we want to find the class y^\hat{y} that maximizes the posterior:

y^=argmaxyP(y)j=1nP(xjy)\hat{y} = \arg\max_y P(y) \prod_{j=1}^n P(x_j \mid y)


2. The Naive Assumption: Feature Independence

The "Naive" part comes from the assumption that given the class yy, all features x1,x2,,xnx_1, x_2, \dots, x_n are independent:

P(x1,x2,,xny)=P(x1y)P(x2y)P(xny)=j=1nP(xjy)P(x_1, x_2, \dots, x_n \mid y) = P(x_1 \mid y) P(x_2 \mid y) \dots P(x_n \mid y) = \prod_{j=1}^n P(x_j \mid y)

Example: Likelihood from Data

Consider a dataset with features x1,x2,x3x_1, x_2, x_3 and target yy:

x1x_1x2x_2x3x_3yy
1350
2611
3190
41011

To classify a new point, we calculate P(y=0)P(x1y=0)P(y=0) \cdot P(x_1 \mid y=0) \dots and P(y=1)P(x1y=1)P(y=1) \cdot P(x_1 \mid y=1) \dots and pick the maximum.

Class-Conditional Densities

P(x | Class): The distribution of features for Spam vs Ham. Naive Bayes estimates these independently.

Posterior Probabilities

P(Class | x): The final probability used for classification, derived via Bayes' Theorem.


3. Laplace Smoothing

If a feature value never appears with a class in the training set (e.g., count(x1=5,y=0)=0count(x_1=5, y=0) = 0), the entire product becomes zero. We fix this by smoothing:

P^(xjy)=count(xj,y)+αcount(y)+αV\hat{P}(x_j \mid y) = \frac{count(x_j, y) + \alpha}{count(y) + \alpha \cdot |V|}

Where V|V| is the number of possible values for feature xjx_j, and α\alpha is the smoothing parameter (usually α=1\alpha=1).


4. Summary of Generative Approach

Generative FeatureDescription
ProbabilityModels the joint probability P(x,y)P(x, y)
IndependenceAssumes features are independent given class yy
RuleUses y^=argmaxP(y)P(xjy)\hat{y} = \arg\max P(y) \prod P(x_j \mid y)
SmoothingNecessary to handle unseen feature values

Goal: We want to find the class yy that is most likely to have generated the observed features xx.