Clustering, Mixture Models and EM

Unsupervised learning aims to find structure in unlabeled data. One of the most common tasks is Clustering, which involves grouping similar data points together.

1. K-Means Clustering

K-Means is the simplest and most widely used clustering algorithm. It partitions the data into $K$ clusters by minimizing the within-cluster sum of squares (Inertia):

J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2

Where:

$r_{nk} = 1$ if point $n$ is assigned to cluster $k$ , 0 otherwise.
$\boldsymbol{\mu}_k$ is the center (centroid) of cluster $k$ .

The Algorithm:

Initialize: Choose $K$ random centroids.
Assign: Assign each point to the nearest centroid.
Update: Move each centroid to the mean of its assigned points.
Repeat: Until the centroids no longer move.

K-Means Clustering

K-Means iteratively assigns each data point to the nearest centroid and then re-calculates the centroids based on the mean of all assigned points.

2. Gaussian Mixture Models (GMM)

While K-Means performs "hard" assignments (a point belongs to exactly one cluster), Mixture Models provide "soft" assignments (a point has a probability of belonging to each cluster).

A GMM assumes the data is generated from a weighted sum of $K$ Gaussians:

p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where $\pi_k$ are the mixing coefficients (must sum to 1).

Gaussian Mixture Models (Soft Assignment)

Unlike K-Means, GMMs provide a probability distribution over clusters. Opacity here represents 'certainty' of cluster membership.

3. The Expectation-Maximization (EM) Algorithm

EM is a powerful iterative framework for finding Maximum Likelihood estimates in latent-variable models like GMMs.

E-Step (Expectation):

Calculate the "responsibilities" $\gamma(z_{nk})$ —the probability that point $n$ was generated by component $k$ given the current parameters.

M-Step (Maximization):

Update the parameters ( $\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k$ ) to maximize the expected log-likelihood based on the E-step responsibilities.

🔄

K-Means as a Limit: K-Means can be seen as a special, non-probabilistic case of the EM algorithm for Gaussian mixtures where we assume all covariances are $\epsilon \mathbf{I}$ and take the limit $\epsilon \to 0$ .

Index Index