Machine Learning
Linear Classification
Naive Bayes

Naive Bayes

Naive Bayes is a probabilistic Generative Model for multi-class classification. It treats both inputs xx and outputs yy as random variables.

⚖️

Foundations: Probability Rules Understanding Naive Bayes requires mastering three fundamental rules:

  1. Product Rule: P(X,Y)=P(YX)P(X)P(X, Y) = P(Y \mid X) P(X).
  2. Sum Rule: P(X)=YP(X,Y)P(X) = \sum_Y P(X, Y).
  3. Bayes' Rule: P(YX)=P(XY)P(Y)P(X)P(Y \mid X) = \frac{P(X \mid Y) P(Y)}{P(X)}.

1. Bayes' Rule and Generative ML

For generative models, we use Bayes' Rule to find the posterior:

P(yx)=P(xy)P(y)P(x)P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)}

In classification, we want to find the class y^\hat{y} that maximizes the posterior:

y^=argmaxyP(y)j=1nP(xjy)\hat{y} = \arg\max_y P(y) \prod_{j=1}^n P(x_j \mid y)


2. Naive Bayes for Text Classification

Naive Bayes is particularly powerful for Natural Language Processing (NLP). We represent each document as a "Bag of Words" mapped against a fixed Dictionary of vocabulary.

The Dictionary Model

If our dictionary contains nn words, each document becomes an nn-dimensional vector x\mathbf{x}, where each dimension represents the frequency of a word:

Documentword1word_1word2word_2\dotswordnword_nClass (yy)
d1d_120\dots1Spam
d2d_201\dots0Not Spam

The Frequency Logic

For a document dd, we calculate the probability of it belonging to class cc: c^=argmaxcCP(c)i=1len(d)P(wic)\hat{c} = \arg\max_{c \in C} P(c) \prod_{i=1}^{\text{len}(d)} P(w_i \mid c)

Where P(wic)P(w_i \mid c) is the probability of word wiw_i appearing in a document given the class is cc. This allows us to handle high-dimensional text data efficiently.


3. The Naive Assumption: Feature Independence

The "Naive" part comes from the assumption that given the class yy, all features x1,x2,,xnx_1, x_2, \dots, x_n are independent:

P(x1,x2,,xny)=P(x1y)P(x2y)P(xny)=j=1nP(xjy)P(x_1, x_2, \dots, x_n \mid y) = P(x_1 \mid y) P(x_2 \mid y) \dots P(x_n \mid y) = \prod_{j=1}^n P(x_j \mid y)

Example: Likelihood from Data

Consider a dataset with features x1,x2,x3x_1, x_2, x_3 and target yy:

x1x_1x2x_2x3x_3yy
1350
2611
3190
41011

To classify a new point, we calculate P(y=0)P(x1y=0)P(y=0) \cdot P(x_1 \mid y=0) \dots and P(y=1)P(x1y=1)P(y=1) \cdot P(x_1 \mid y=1) \dots and pick the maximum.

Class-Conditional Densities

P(x | Class): The distribution of features for Spam vs Ham. Naive Bayes estimates these independently.

Posterior Probabilities

P(Class | x): The final probability used for classification, derived via Bayes' Theorem.


3. Laplace Smoothing

If a feature value never appears with a class in the training set (e.g., count(x1=5,y=0)=0count(x_1=5, y=0) = 0), the entire product becomes zero. We fix this by smoothing:

P^(xjy)=count(xj,y)+αcount(y)+αV\hat{P}(x_j \mid y) = \frac{count(x_j, y) + \alpha}{count(y) + \alpha \cdot |V|}

Where V|V| is the number of possible values for feature xjx_j, and α\alpha is the smoothing parameter (usually α=1\alpha=1).


4. Summary of Generative Approach

Generative FeatureDescription
ProbabilityModels the joint probability P(x,y)P(x, y)
IndependenceAssumes features are independent given class yy
RuleUses y^=argmaxP(y)P(xjy)\hat{y} = \arg\max P(y) \prod P(x_j \mid y)
SmoothingNecessary to handle unseen feature values

Goal: We want to find the class yy that is most likely to have generated the observed features xx.