Logistic Regression

Despite its name, Logistic Regression is a classification algorithm. Unlike generative models which model the class-conditional distributions, logistic regression directly models the posterior probability of the classes.

1. The Probabilistic Model

Logistic Regression assumes a Discriminative approach. We model the probability of a binary outcome $y \in \{0, 1\}$ as a Bernoulli distribution:

$y \sim \text{Bernoulli}(y \mid \sigma(\theta^T x))$

Where $\sigma(a)$ is the Sigmoid (Logistic) Function: $\sigma(a) = \frac{1}{1 + \exp(-a)} = h(x)$

The probability for a single sample is: $P(y \mid x; \theta) = h(x)^y (1 - h(x))^{1 - y}$

The Sigmoid Graph

The Sigmoid function creates a characteristic "S-curve". As drawn in the notes:

The x-axis represents the linear combination $z = \theta^T x$ .
The y-axis represents the probability output $(0 \text{ to } 1)$ .
The curve asymptotically approaches $0$ as $z \to -\infty$ and $1$ as $z \to +\infty$ .
It flawlessly crosses exactly at $y = 0.5$ when $z = 0$ , establishing the decision threshold.

Interactive Logistic Sigmoid

Adjust the weight and bias to see how the decision boundary (p=0.5) shifts and how the 'certainty' of the model changes.

Weight (w)

1.0

Controls the steepness (certainty)

Bias (b)

0.0

Shifts the curve left/right

2. Parameter Estimation & Maximum Likelihood

In Machine Learning, finding the optimal weights is formally known as Parameter Estimation, and the core technique used is Maximum Likelihood Estimation (MLE).

Intuition: The Coin Trial

Consider a coin flip with an unknown probability of landing Heads, parameter $\phi$ . Given data from three independent flips: $x_1 = H, x_2 = T, x_3 = T$ , we want to find the value of $\phi$ that maximizes the probability (Likelihood) of seeing this exact sequence.

$P(x_i = H) = \phi$
$P(x_i = T) = 1 - \phi$

Since the flips are Independent and Identically Distributed (I.I.D), the overall probability (Likelihood) is the product: $L(\phi) = P(x_1)P(x_2)P(x_3) = \phi(1-\phi)(1-\phi) = \phi(1-\phi)^2$

To optimize, we take the log-likelihood and set the derivative to zero. This mathematical mechanism strictly evaluates which parameter $\phi$ makes our observed dataset the most probable outcome.

MLE in Logistic Regression

We apply this exact process directly to model learning. As outlined in the notes, the output variable $y$ inherently acts as a Random Variable mapped against the underlying features $(x_1, x_2, \dots, x_n)$ :

$x_1, x_2$	$\dots x_n$	$y \to$ Random Variable
$\dots$	$\dots$	$0$
$\dots$	$\dots$	$1$
$\dots$	$\dots$	$1$
$\dots$	$\dots$	$0$

Given this dataset formulation, we treat each row as an independent Bernoulli trial. The Likelihood of the entire dataset $D$ is the strict product of these individual probabilities (assuming i.i.d.):

$L(\theta; D) = \prod_{i=1}^m P(y^{(i)} \mid x^{(i)}; \theta) = \prod_{i=1}^m h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}}$

Log-Likelihood Derivation

To find the optimal $\theta$ , we maximize the Log-Likelihood:

$\log L(\theta; D) = \sum_{i=1}^m \log \left[ h(x^{(i)})^{y^{(i)}} (1 - h(x^{(i)}))^{1 - y^{(i)}} \right]$

$\log L(\theta; D) = \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]$

In practice, we often minimize the Negative Log-Likelihood (NLL), which is the same as the Binary Cross-Entropy Loss: $J(\theta) = - \frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log h(x^{(i)}) + (1 - y^{(i)}) \log(1 - h(x^{(i)})) \right]$

Visualizing the Cross-Entropy Loss

The NLL directly translates to the empirical loss equation. Visually, the notes plot this beautifully as two interlocking curves intersecting the axes:

If $y=1$ , the active loss is $-\log(h(x))$ . As the prediction $h(x)$ decreases toward $0$ , the penalty skyrockets asymptotically toward infinity.
If $y=0$ , the active loss is $-\log(1-h(x))$ . As the prediction incorrectly approaches $1$ , the penalty skyrockets.

This ensures the model is heavily penalized algorithmically for being confidently wrong.

3. Decision Boundary

The decision boundary is the set of points where the probability of both classes is equal: $P(y=1 \mid x) = 0.5 \iff \theta^T x = 0$

This results in a linear decision boundary.

Goal: We find the parameters $\theta$ that maximize the likelihood of our observed data. There is no closed-form solution, so we use iterative optimization like Gradient Descent.

Gradient Descent & The Chain Rule

To iteratively update our parameters, we find the partial derivative of the cost function $J(\theta)$ with respect to each independent parameter weight $\theta_j$ . As highlighted graphically in the notes, optimizing this mechanism structurally requires applying the Chain Rule:

$\frac{\partial J}{\partial \theta_j} = \frac{\partial J}{\partial h} \cdot \frac{\partial h}{\partial z} \cdot \frac{\partial z}{\partial \theta_j}$

Yielding the fundamental gradient minimization update rule for the optimal solution: $\theta_j \leftarrow \theta_j - \alpha \frac{\partial J(\theta)}{\partial \theta_j}$

Perceptron Log-Linear Models