Machine Learning
Linear Classification
Multiclass Classification

Multiclass Classification

Binary classification strictly targets predictions between 00 and 11. When a dataset branches symmetrically into KK distinct classes (where K>2K > 2), the mathematical constraints must dynamically adapt and expand. This architecture is formally known as Multiclass Classification.

Visualizing Multiclass (One-vs-All)

The notes graphically depict an intuitive One-Vs-All (or One-Vs-Rest) classification approach visually represented via a scatter plot consisting of three unique target class shapes: Triangles (Δ\Delta), Circles (oo), and Crosses (xx).

Instead of drawing a single separating line simultaneously accommodating everything, the algorithm learns structurally to isolate one specific class at a time. For instance, it formulates a unified geometric bounding threshold that specifically subgroups all the Triangles locally (assigning them into Class 11), while mathematically treating both the Circles and Crosses identically as an aggregated collective block (Class 00).

One-Hot Encoding Target Variables

To handle these independent dimensions algebraically without inferring arbitrary numeric magnitudes between classes, the scalar target label logically transitions into a matrix mapping formulation One-hot encoding. Instead of assigning a naive y=2y=2, the classification belonging uniquely to class 2 out of 3 is represented smoothly via an isolated column vector logically mapping 11:

[010]\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}

In an expanded dataset spanning model parameters θ0,θ1θn\theta_0, \theta_1 \dots \theta_n aligned alongside features x1xnx_1 \dots x_n, each dimension uniquely acquires internal parameterized numeric boundaries.


The Softmax Function

To gracefully abstract the structural Sigmoid function boundary formula to accommodate flexible multi-class dimensions uniformly, we utilize the mathematically profound Softmax function. Softmax inherently compresses any vector sequence proportionally into a tightly normalized probabilistic density sum bounded perfectly to equal precisely 1.01.0.

The integrated hypothesis fundamentally computes the explicit probability that input variable xx matches exactly to the specific class index ii: h(i)(x)=eθiTxk=1KeθkTxh^{(i)}(x) = \frac{e^{\theta_i^T x}}{\sum_{k=1}^K e^{\theta_k^T x}}

Categorical Cross-Entropy Loss

Since parameters optimize multi-laterally across KK dimensional categories simultaneously, Binary cross-entropy definitions gracefully generalize natively into continuous Categorical Cross-Entropy tracking.

Evaluating mathematically a unified ground truth labeled state index variable vector yy mapped parallel to predicted outputs structurally evaluated as y^=h(x)\hat{y} = h(x), the aggregate algorithmic loss derives identically through the equation series: Loss=k=1Kyklogy^k\text{Loss} = - \sum_{k=1}^K y_k \log \hat{y}_k

Minimizing structurally this combined penalty loss empirically pressures iterative neural algorithmic descents intuitively towards maximally scoring likelihood solely at intersecting identical correct logical matching index dimensions accurately.