Naive Bayes

Naive Bayes is a probabilistic Generative Model for multi-class classification. It treats both inputs $x$ and outputs $y$ as random variables.

⚖️

Foundations: Probability Rules Understanding Naive Bayes requires mastering three fundamental rules:

Product Rule: $P(X, Y) = P(Y \mid X) P(X)$ .
Sum Rule: $P(X) = \sum_Y P(X, Y)$ .
Bayes' Rule: $P(Y \mid X) = \frac{P(X \mid Y) P(Y)}{P(X)}$ .

1. Bayes' Rule and Generative ML

For generative models, we use Bayes' Rule to find the posterior:

$P(y \mid x) = \frac{P(x \mid y) P(y)}{P(x)}$

In classification, we want to find the class $\hat{y}$ that maximizes the posterior:

$\hat{y} = \arg\max_y P(y) \prod_{j=1}^n P(x_j \mid y)$

2. Naive Bayes for Text Classification

Naive Bayes is particularly powerful for Natural Language Processing (NLP). We represent each document as a "Bag of Words" mapped against a fixed Dictionary of vocabulary.

The Dictionary Model

If our dictionary contains $n$ words, each document becomes an $n$ -dimensional vector $\mathbf{x}$ , where each dimension represents the frequency of a word:

Document	$word_1$	$word_2$	$\dots$	$word_n$	Class ( $y$ )
$d_1$	2	0	$\dots$	1	Spam
$d_2$	0	1	$\dots$	0	Not Spam

The Frequency Logic

For a document $d$ , we calculate the probability of it belonging to class $c$ : $\hat{c} = \arg\max_{c \in C} P(c) \prod_{i=1}^{\text{len}(d)} P(w_i \mid c)$

Where $P(w_i \mid c)$ is the probability of word $w_i$ appearing in a document given the class is $c$ . This allows us to handle high-dimensional text data efficiently.

3. The Naive Assumption: Feature Independence

The "Naive" part comes from the assumption that given the class $y$ , all features $x_1, x_2, \dots, x_n$ are independent:

$P(x_1, x_2, \dots, x_n \mid y) = P(x_1 \mid y) P(x_2 \mid y) \dots P(x_n \mid y) = \prod_{j=1}^n P(x_j \mid y)$

Example: Likelihood from Data

Consider a dataset with features $x_1, x_2, x_3$ and target $y$ :

$x_1$	$x_2$	$x_3$	$y$
1	3	5	0
2	6	1	1
3	1	9	0
4	10	1	1

To classify a new point, we calculate $P(y=0) \cdot P(x_1 \mid y=0) \dots$ and $P(y=1) \cdot P(x_1 \mid y=1) \dots$ and pick the maximum.

Class-Conditional Densities

P(x | Class): The distribution of features for Spam vs Ham. Naive Bayes estimates these independently.

Posterior Probabilities

P(Class | x): The final probability used for classification, derived via Bayes' Theorem.

3. Laplace Smoothing

If a feature value never appears with a class in the training set (e.g., $count(x_1=5, y=0) = 0$ ), the entire product becomes zero. We fix this by smoothing:

$\hat{P}(x_j \mid y) = \frac{count(x_j, y) + \alpha}{count(y) + \alpha \cdot |V|}$

Where $|V|$ is the number of possible values for feature $x_j$ , and $\alpha$ is the smoothing parameter (usually $\alpha=1$ ).

4. Summary of Generative Approach

Generative Feature	Description
Probability	Models the joint probability $P(x, y)$
Independence	Assumes features are independent given class $y$
Rule	Uses $\hat{y} = \arg\max P(y) \prod P(x_j \mid y)$
Smoothing	Necessary to handle unseen feature values

Goal: We want to find the class $y$ that is most likely to have generated the observed features $x$ .

Probabilistic Generative Models Perceptron