Clustering, Mixture Models and EM
Unsupervised learning aims to find structure in unlabeled data. One of the most common tasks is Clustering, which involves grouping similar data points together.
1. K-Means Clustering
K-Means is the simplest and most widely used clustering algorithm. It partitions the data into clusters by minimizing the within-cluster sum of squares (Inertia):
Where:
- if point is assigned to cluster , 0 otherwise.
- is the center (centroid) of cluster .
The Algorithm:
- Initialize: Choose random centroids.
- Assign: Assign each point to the nearest centroid.
- Update: Move each centroid to the mean of its assigned points.
- Repeat: Until the centroids no longer move.
K-Means Clustering
K-Means iteratively assigns each data point to the nearest centroid and then re-calculates the centroids based on the mean of all assigned points.
2. Gaussian Mixture Models (GMM)
While K-Means performs "hard" assignments (a point belongs to exactly one cluster), Mixture Models provide "soft" assignments (a point has a probability of belonging to each cluster).
A GMM assumes the data is generated from a weighted sum of Gaussians:
where are the mixing coefficients (must sum to 1).
Gaussian Mixture Models (Soft Assignment)
Unlike K-Means, GMMs provide a probability distribution over clusters. Opacity here represents 'certainty' of cluster membership.
3. The Expectation-Maximization (EM) Algorithm
EM is a powerful iterative framework for finding Maximum Likelihood estimates in latent-variable models like GMMs.
E-Step (Expectation):
Calculate the "responsibilities" —the probability that point was generated by component given the current parameters.
M-Step (Maximization):
Update the parameters () to maximize the expected log-likelihood based on the E-step responsibilities.
K-Means as a Limit: K-Means can be seen as a special, non-probabilistic case of the EM algorithm for Gaussian mixtures where we assume all covariances are and take the limit .