Machine Learning
Probabilistic Discriminative Models
Logistic Regression

Logistic Regression

Despite its name, Logistic Regression is a classification algorithm. Unlike generative models which model the class-conditional distributions, logistic regression directly models the posterior probability of the classes.


1. The Probabilistic Model

Logistic Regression assumes a Discriminative approach. We model the probability of a binary outcome y{0,1}y \in \{0, 1\} as a Bernoulli distribution:

yBernoulli(yσ(θTx))y \sim \text{Bernoulli}(y \mid \sigma(\theta^T x))

Where σ(a)\sigma(a) is the Sigmoid (Logistic) Function: σ(a)=11+exp(a)=h(x)\sigma(a) = \frac{1}{1 + \exp(-a)} = h(x)

The probability for a single sample is: P(yx;θ)=h(x)y(1h(x))1yP(y \mid x; \theta) = h(x)^y (1 - h(x))^{1 - y}

Interactive Logistic Sigmoid

Adjust the weight and bias to see how the decision boundary (p=0.5) shifts and how the 'certainty' of the model changes.

Weight (w)
1.0
Controls the steepness (certainty)
Bias (b)
0.0
Shifts the curve left/right

2. Training: Maximum Likelihood

Given a dataset D={(x(i),y(i))}i=1mD = \{(x^{(i)}, y^{(i)})\}_{i=1}^m, the Likelihood is the product of individual probabilities (assuming i.i.d.):

L(θ;D)=i=1mP(y(i)x(i);θ)=i=1mh(x(i))y(i)(1h(x(i)))1y(i)L(\theta; D) = \prod_{i=1}^m P(y^{(i)} \mid x^{(i)}; \theta) = \prod_{i=1}^m h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}}

Log-Likelihood Derivation

To find the optimal θ\theta, we maximize the Log-Likelihood:

logL(θ;D)=i=1mlog[h(x(i))y(i)(1h(x(i)))1y(i)]\log L(\theta; D) = \sum_{i=1}^m \log \left[ h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}} \right]

logL(θ;D)=i=1m[y(i)logh(x(i))+(1y(i))log(1h(x(i)))]\log L(\theta; D) = \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]

In practice, we often minimize the Negative Log-Likelihood (NLL), which is the same as the Binary Cross-Entropy Loss: J(θ)=1mi=1m[y(i)logh(x(i))+(1y(i))log(1h(x(i)))]J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]


3. Decision Boundary

The decision boundary is the set of points where the probability of both classes is equal: P(y=1x)=0.5    θTx=0P(y=1 \mid x) = 0.5 \iff \theta^T x = 0

This results in a linear decision boundary.

Goal: We find the parameters θ\theta that maximize the likelihood of our observed data. There is no closed-form solution, so we use iterative optimization like Gradient Descent.