Machine Learning
Probabilistic Discriminative Models
Log-Linear Models

Log-Linear Models (Softmax Regression)

Log-linear models are a generalization of logistic regression to handle multiclass classification. Instead of a single sigmoid, we use the Softmax function.


1. The Softmax Function

Given KK possible classes, the probability that an input x\mathbf{x} belongs to class kk is:

p(Ckx)=exp(ak)j=1Kexp(aj)p(C_k \mid \mathbf{x}) = \frac{\exp(a_k)}{\sum_{j=1}^K \exp(a_j)}

Where ak=wkTx+bka_k = \mathbf{w}_k^T \mathbf{x} + b_k is the linear predictor for class kk.

Softmax Probabilities and Temperature

The Softmax function converts raw scores (logits) into a probability distribution. Temperature (T) controls how 'sharp' or 'smooth' the distribution is.

Temperature (T)
1.0
High T = Smooth, Low T = Sharp

2. Training: Cross-Entropy Loss

The multiclass cross-entropy loss function is:

E(w1,,wK)=n=1Nk=1KtnklnynkE(\mathbf{w}_1, \dots, \mathbf{w}_K) = -\sum_{n=1}^N \sum_{k=1}^K t_{nk} \ln y_{nk}

Where:

  • tnkt_{nk} is the indicator variable (1 if the nn-th sample belongs to class kk, 0 otherwise).
  • ynk=p(Ckxn)y_{nk} = p(C_k \mid \mathbf{x}_n) is the predicted probability.

This is the standard loss function for most modern neural network classifiers.


3. Key Properties

  • Sum to One: k=1Kp(Ckx)=1\sum_{k=1}^K p(C_k \mid \mathbf{x}) = 1.
  • Probabilities: 0p(Ckx)10 \le p(C_k \mid \mathbf{x}) \le 1.
  • Decision Rule: We assign the sample to the class with the highest probability.

Softmax and Temperature: Sometimes we divide the linear predictors aka_k by a temperature parameter TT: p(Ckx)=exp(ak/T)exp(aj/T)p(C_k \mid \mathbf{x}) = \frac{\exp(a_k/T)}{\sum \exp(a_j/T)}.

  • High TT makes the distribution more uniform (higher uncertainty).
  • Low TT makes the distribution more peaked (higher confidence).