Machine Learning
Issues in ML
Model Selection

Model Selection

Model selection is the process of choosing the best model from a set of candidate models. This involves balancing model complexity to achieve the best generalization on new data.


1. The Validation Strategy

We cannot use the Test Set to choose our model, as that would "leak" information and lead to over-optimistic results. Instead, we split the training data further:

  1. Training Set (e.g., 70-80%): Used to fit the parameters (θ\boldsymbol{\theta}) of the model.
  2. Validation Set (e.g., 10-20%): Used to tune Hyperparameters (e.g., learning rate α\alpha, polynomial degree dd) and choose the best model architecture.
  3. Test Set (e.g., 10%): Used ONLY once at the very end to estimate real-world performance on unseen data.

2. Hyperparameter Tuning

Hyperparameters are settings that are chosen before the learning process begins. They are not learned by Gradient Descent. We use the Validation Set to find their optimal values.

Example: Selecting the Learning Rate α\alpha

ModelLearning Rate (α\alpha)Training ErrorValidation ErrorStatus
Model 10.1High (213)High (217)Too High
Model 20.01Low (218)Low (210)Optimal
Model 30.001Low (310)Low (311)Too Slow

Notice how Model 2 is the best choice because it has the lowest validation error, even if its training error is slightly higher than Model 1.


3. k-Fold Cross-Validation

When data is scarce, we use k-Fold Cross-Validation:

  • Split training data into kk "folds".
  • Train kk times, each time using k1k-1 folds for training and 1 fold for validation.
  • Average the validation performance across all kk runs.

k-Fold Cross Validation (k=5)

Training
Validation
Iter 1
Valid
Train
Train
Train
Train
Iter 2
Train
Valid
Train
Train
Train
Iter 3
Train
Train
Valid
Train
Train
Iter 4
Train
Train
Train
Valid
Train
Iter 5
Train
Train
Train
Train
Valid

The dataset is split into 5 equal folds. In each iteration, one fold is used for validation and the other 4 folds are used for training. The final score is the average across all 5 iterations.


4. Visualizing the Selection Curve

The optimal model is located at the point where the validation error is at its minimum.

The Model Selection Curve

As model complexity increases, training error decreases. However, test error (generalization error) initially decreases then starts to increase once overfitting begins.

💡

Early Stopping: In iterative algorithms like Neural Networks, we can monitor the validation error during training and stop as soon as it begins to rise, effectively "selecting" the best version of the model automatically.