Kernel Methods and Sparse Kernel Machines
Kernel methods are a powerful class of algorithms that allow linear models to solve non-linear problems by projecting data into a higher-dimensional feature space.
1. The Kernel Trick
The key insight of kernel methods is that many linear algorithms only depend on the inner product between data points ().
Instead of explicitly mapping data to a high-dimensional space (which could be infinite-dimensional!), we can use a Kernel Function that computes the inner product in that space directly:
Common Kernels:
- Linear:
- Polynomial:
- RBF (Gaussian):
Original Space (2D)
Data is not linearly separable. We cannot draw a straight line between the inner circle and outer ring.
Feature Space (with z = x² + y²)
By mapping the data to a higher dimension (z), the classes become linearly separable by a horizontal plane.
2. Sparse Kernel Machines: SVMs
Support Vector Machines (SVMs) are the most famous sparse kernel machines. They aim to find the decision boundary that maximizes the margin between classes.
The Objective:
We want to find and that maximize the margin subject to:
where are the class labels.
Support Vectors:
Unlike other models, the SVM solution only depends on a small subset of the training points—those that lie exactly on the margin. These are called Support Vectors. This leads to a "sparse" solution.
Support Vector Machine (Maximum Margin)
The SVM finds the decision boundary (solid line) that maximizes the 'margin' (distance to the nearest data points, called support vectors). Shaded areas represent the margin.
3. Soft Margins and Slack Variables
In real-world data, classes are rarely perfectly separable. We introduce slack variables to allow some points to be on the wrong side of the margin (or boundary), penalized by a parameter :
- Large : Narrow margin, few misclassifications (High Variance).
- Small : Wide margin, allows more misclassifications (High Bias).
Relevance Vector Machines (RVM): While SVMs are sparse and deterministic, RVMs provide a Bayesian alternative that is even sparser and provides predictive uncertainty, though they are computationally more expensive to train.