Logistic Regression
Despite its name, Logistic Regression is a classification algorithm. Unlike generative models which model the class-conditional distributions, logistic regression directly models the posterior probability of the classes.
1. The Probabilistic Model
Logistic Regression assumes a Discriminative approach. We model the probability of a binary outcome as a Bernoulli distribution:
Where is the Sigmoid (Logistic) Function:
The probability for a single sample is:
The Sigmoid Graph
The Sigmoid function creates a characteristic "S-curve". As drawn in the notes:
- The x-axis represents the linear combination .
- The y-axis represents the probability output .
- The curve asymptotically approaches as and as .
- It flawlessly crosses exactly at when , establishing the decision threshold.
Interactive Logistic Sigmoid
Adjust the weight and bias to see how the decision boundary (p=0.5) shifts and how the 'certainty' of the model changes.
2. Parameter Estimation & Maximum Likelihood
In Machine Learning, finding the optimal weights is formally known as Parameter Estimation, and the core technique used is Maximum Likelihood Estimation (MLE).
Intuition: The Coin Trial
Consider a coin flip with an unknown probability of landing Heads, parameter . Given data from three independent flips: , we want to find the value of that maximizes the probability (Likelihood) of seeing this exact sequence.
Since the flips are Independent and Identically Distributed (I.I.D), the overall probability (Likelihood) is the product:
To optimize, we take the log-likelihood and set the derivative to zero. This mathematical mechanism strictly evaluates which parameter makes our observed dataset the most probable outcome.
MLE in Logistic Regression
We apply this exact process directly to model learning. As outlined in the notes, the output variable inherently acts as a Random Variable mapped against the underlying features :
| Random Variable | ||
|---|---|---|
Given this dataset formulation, we treat each row as an independent Bernoulli trial. The Likelihood of the entire dataset is the strict product of these individual probabilities (assuming i.i.d.):
Log-Likelihood Derivation
To find the optimal , we maximize the Log-Likelihood:
In practice, we often minimize the Negative Log-Likelihood (NLL), which is the same as the Binary Cross-Entropy Loss:
Visualizing the Cross-Entropy Loss
The NLL directly translates to the empirical loss equation. Visually, the notes plot this beautifully as two interlocking curves intersecting the axes:
- If , the active loss is . As the prediction decreases toward , the penalty skyrockets asymptotically toward infinity.
- If , the active loss is . As the prediction incorrectly approaches , the penalty skyrockets.
This ensures the model is heavily penalized algorithmically for being confidently wrong.
3. Decision Boundary
The decision boundary is the set of points where the probability of both classes is equal:
This results in a linear decision boundary.
Goal: We find the parameters that maximize the likelihood of our observed data. There is no closed-form solution, so we use iterative optimization like Gradient Descent.
Gradient Descent & The Chain Rule
To iteratively update our parameters, we find the partial derivative of the cost function with respect to each independent parameter weight . As highlighted graphically in the notes, optimizing this mechanism structurally requires applying the Chain Rule:
Yielding the fundamental gradient minimization update rule for the optimal solution: