Machine Learning
Clustering & Mixture Models
Index

Clustering, Mixture Models and EM

Unsupervised learning aims to find structure in unlabeled data. One of the most common tasks is Clustering, which involves grouping similar data points together.


1. K-Means Clustering

K-Means is the simplest and most widely used clustering algorithm. It partitions the data into KK clusters by minimizing the within-cluster sum of squares (Inertia):

J=n=1Nk=1Krnkxnμk2J = \sum_{n=1}^N \sum_{k=1}^K r_{nk} \|\mathbf{x}_n - \boldsymbol{\mu}_k\|^2

Where:

  • rnk=1r_{nk} = 1 if point nn is assigned to cluster kk, 0 otherwise.
  • μk\boldsymbol{\mu}_k is the center (centroid) of cluster kk.

The Algorithm:

  1. Initialize: Choose KK random centroids.
  2. Assign: Assign each point to the nearest centroid.
  3. Update: Move each centroid to the mean of its assigned points.
  4. Repeat: Until the centroids no longer move.

K-Means Clustering

K-Means iteratively assigns each data point to the nearest centroid and then re-calculates the centroids based on the mean of all assigned points.


2. Gaussian Mixture Models (GMM)

While K-Means performs "hard" assignments (a point belongs to exactly one cluster), Mixture Models provide "soft" assignments (a point has a probability of belonging to each cluster).

A GMM assumes the data is generated from a weighted sum of KK Gaussians:

p(x)=k=1KπkN(xμk,Σk)p(\mathbf{x}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)

where πk\pi_k are the mixing coefficients (must sum to 1).

Gaussian Mixture Models (Soft Assignment)

Unlike K-Means, GMMs provide a probability distribution over clusters. Opacity here represents 'certainty' of cluster membership.


3. The Expectation-Maximization (EM) Algorithm

EM is a powerful iterative framework for finding Maximum Likelihood estimates in latent-variable models like GMMs.

E-Step (Expectation):

Calculate the "responsibilities" γ(znk)\gamma(z_{nk})—the probability that point nn was generated by component kk given the current parameters.

M-Step (Maximization):

Update the parameters (πk,μk,Σk\pi_k, \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) to maximize the expected log-likelihood based on the E-step responsibilities.

🔄

K-Means as a Limit: K-Means can be seen as a special, non-probabilistic case of the EM algorithm for Gaussian mixtures where we assume all covariances are ϵI\epsilon \mathbf{I} and take the limit ϵ0\epsilon \to 0.