Machine Learning
Probabilistic Discriminative Models
Logistic Regression

Logistic Regression

Despite its name, Logistic Regression is a classification algorithm. Unlike generative models which model the class-conditional distributions, logistic regression directly models the posterior probability of the classes.


1. The Probabilistic Model

Logistic Regression assumes a Discriminative approach. We model the probability of a binary outcome y{0,1}y \in \{0, 1\} as a Bernoulli distribution:

yBernoulli(yσ(θTx))y \sim \text{Bernoulli}(y \mid \sigma(\theta^T x))

Where σ(a)\sigma(a) is the Sigmoid (Logistic) Function: σ(a)=11+exp(a)=h(x)\sigma(a) = \frac{1}{1 + \exp(-a)} = h(x)

The probability for a single sample is: P(yx;θ)=h(x)y(1h(x))1yP(y \mid x; \theta) = h(x)^y (1 - h(x))^{1 - y}

The Sigmoid Graph

The Sigmoid function creates a characteristic "S-curve". As drawn in the notes:

  • The x-axis represents the linear combination z=θTxz = \theta^T x.
  • The y-axis represents the probability output (0 to 1)(0 \text{ to } 1).
  • The curve asymptotically approaches 00 as zz \to -\infty and 11 as z+z \to +\infty.
  • It flawlessly crosses exactly at y=0.5y = 0.5 when z=0z = 0, establishing the decision threshold.

Interactive Logistic Sigmoid

Adjust the weight and bias to see how the decision boundary (p=0.5) shifts and how the 'certainty' of the model changes.

Weight (w)
1.0
Controls the steepness (certainty)
Bias (b)
0.0
Shifts the curve left/right

2. Parameter Estimation & Maximum Likelihood

In Machine Learning, finding the optimal weights is formally known as Parameter Estimation, and the core technique used is Maximum Likelihood Estimation (MLE).

Intuition: The Coin Trial

Consider a coin flip with an unknown probability of landing Heads, parameter ϕ\phi. Given data from three independent flips: x1=H,x2=T,x3=Tx_1 = H, x_2 = T, x_3 = T, we want to find the value of ϕ\phi that maximizes the probability (Likelihood) of seeing this exact sequence.

  • P(xi=H)=ϕP(x_i = H) = \phi
  • P(xi=T)=1ϕP(x_i = T) = 1 - \phi

Since the flips are Independent and Identically Distributed (I.I.D), the overall probability (Likelihood) is the product: L(ϕ)=P(x1)P(x2)P(x3)=ϕ(1ϕ)(1ϕ)=ϕ(1ϕ)2L(\phi) = P(x_1)P(x_2)P(x_3) = \phi(1-\phi)(1-\phi) = \phi(1-\phi)^2

To optimize, we take the log-likelihood and set the derivative to zero. This mathematical mechanism strictly evaluates which parameter ϕ\phi makes our observed dataset the most probable outcome.

MLE in Logistic Regression

We apply this exact process directly to model learning. As outlined in the notes, the output variable yy inherently acts as a Random Variable mapped against the underlying features (x1,x2,,xn)(x_1, x_2, \dots, x_n):

x1,x2x_1, x_2 xn\dots x_nyy \to Random Variable
\dots\dots00
\dots\dots11
\dots\dots11
\dots\dots00

Given this dataset formulation, we treat each row as an independent Bernoulli trial. The Likelihood of the entire dataset DD is the strict product of these individual probabilities (assuming i.i.d.):

L(θ;D)=i=1mP(y(i)x(i);θ)=i=1mh(x(i))y(i)(1h(x(i)))1y(i)L(\theta; D) = \prod_{i=1}^m P(y^{(i)} \mid x^{(i)}; \theta) = \prod_{i=1}^m h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}}

Log-Likelihood Derivation

To find the optimal θ\theta, we maximize the Log-Likelihood:

logL(θ;D)=i=1mlog[h(x(i))y(i)(1h(x(i)))1y(i)]\log L(\theta; D) = \sum_{i=1}^m \log \left[ h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}} \right]

logL(θ;D)=i=1m[y(i)logh(x(i))+(1y(i))log(1h(x(i)))]\log L(\theta; D) = \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]

In practice, we often minimize the Negative Log-Likelihood (NLL), which is the same as the Binary Cross-Entropy Loss: J(θ)=1mi=1m[y(i)logh(x(i))+(1y(i))log(1h(x(i)))]J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]

Visualizing the Cross-Entropy Loss

The NLL directly translates to the empirical loss equation. Visually, the notes plot this beautifully as two interlocking curves intersecting the axes:

  • If y=1y=1, the active loss is log(h(x))-\log(h(x)). As the prediction h(x)h(x) decreases toward 00, the penalty skyrockets asymptotically toward infinity.
  • If y=0y=0, the active loss is log(1h(x))-\log(1-h(x)). As the prediction incorrectly approaches 11, the penalty skyrockets.

This ensures the model is heavily penalized algorithmically for being confidently wrong.


3. Decision Boundary

The decision boundary is the set of points where the probability of both classes is equal: P(y=1x)=0.5    θTx=0P(y=1 \mid x) = 0.5 \iff \theta^T x = 0

This results in a linear decision boundary.

Goal: We find the parameters θ\theta that maximize the likelihood of our observed data. There is no closed-form solution, so we use iterative optimization like Gradient Descent.

Gradient Descent & The Chain Rule

To iteratively update our parameters, we find the partial derivative of the cost function J(θ)J(\theta) with respect to each independent parameter weight θj\theta_j. As highlighted graphically in the notes, optimizing this mechanism structurally requires applying the Chain Rule:

Jθj=Jhhzzθj\frac{\partial J}{\partial \theta_j} = \frac{\partial J}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial \theta_j}

Yielding the fundamental gradient minimization update rule for the optimal solution: θjθjαJ(θ)θj\theta_j \leftarrow \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}