Machine Learning
Issues in ML
Overfitting

Overfitting and Underfitting

The ultimate goal of any machine learning model is generalization: the ability to perform well on new, unseen data. Two common pitfalls prevent this: Overfitting and Underfitting.


1. Underfitting (High Bias)

Underfitting occurs when a model is too simple to capture the underlying structure of the data.

  • Symptom: High error on both training and test data.
  • Cause: The model makes strong, incorrect assumptions (e.g., trying to fit a straight line to data that is clearly curved).
  • Solution: Increase model complexity, use more features, or reduce regularization.

2. Overfitting (High Variance)

Overfitting occurs when a model is so complex that it starts "memorizing" the noise in the training data rather than the actual signal.

  • Symptom: Very low error on training data, but high error on test data.
  • Cause: The model is too flexible and has too many parameters relative to the amount of data (e.g., using a 10th-degree polynomial to fit 5 points).
  • Solution: Use more training data, simplify the model (feature selection), or apply regularization.

The Goldilocks Zone

We want a model that is "just right"—complex enough to capture the trend, but simple enough to ignore the random noise.

Underfitting (High Bias)

Model is too simple to capture the underlying trend.

Balanced (Good Fit)

Model captures the general trend without being distracted by noise.

Overfitting (High Variance)

Model fits the training noise perfectly but misses the true trend.


Signals of Overfitting

Training ErrorTest ErrorStatus
HighHighUnderfitting (Model is too simple)
LowHighOverfitting (Model is memorizing noise)
LowLowGood Fit (Model generalizes well)

Example: Polynomial Degree vs. Validation Error

Suppose we are fitting a model with different polynomial degrees. We measure the average validation error across several runs:

Degree (dd)Validation ErrorStatus
110.5Underfitting (too simple)
28.7Improving fit
35.6Optimal (Captures the true trend)
48.6Beginning to overfit
59.7Severe Overfitting (memorizing noise)

As we increase the degree, the model's flexibility grows. However, after degree 3, the validation error starts to rise because the model begins to capture noise rather than signal.

The Bias-Variance Tradeoff: The generalization error of a model can be mathematically decomposed as: Err(x)=Bias2(x)+Var(x)+σ2Err(x) = \text{Bias}^2(x) + \text{Var}(x) + \sigma^2 (where σ2\sigma^2 is irreducible error)

  • Bias is the error from erroneous assumptions in the learning algorithm. High bias leads to underfitting.
  • Variance is the error from sensitivity to small fluctuations in the training set. High variance leads to overfitting.

We strive to minimize the sum of both. As we increase model complexity, Bias decreases but Variance increases. This intrinsic inverse relationship requires us to locate the optimal minimum on the generalization error curve.