Machine Learning
Issues in ML
Model Selection

Model Selection

Model selection is the process of choosing the best model from a set of candidate models. This involves balancing model complexity to achieve the best generalization on new data.


1. The Validation Set

We cannot use the Test Set to choose our model, as that would "leak" information and lead to over-optimistic results. Instead, we split the training data further:

  1. Training Set: Used to fit the parameters of the model.
  2. Validation Set: Used to tune hyperparameters and choose the best model architecture.
  3. Test Set: Used ONLY once at the very end to estimate real-world performance.

2. Cross-Validation

When data is scarce, we use k-Fold Cross-Validation:

  • Split training data into kk "folds".
  • Train kk times, each time using k1k-1 folds for training and 1 fold for validation.
  • Average the validation performance across all kk runs.

3. Information Criteria

Sometimes we prefer models that are simpler, even if they have slightly higher training error.

  • AIC (Akaike Information Criterion): Rewards goodness of fit but penalizes the number of parameters.
  • BIC (Bayesian Information Criterion): Similar to AIC but with a stronger penalty for the number of parameters.

4. Visualizing the Selection Curve

The optimal model is located at the point where the validation error is at its minimum.

The Model Selection Curve

As model complexity increases, training error decreases. However, test error (generalization error) initially decreases then starts to increase once overfitting begins.

💡

Early Stopping: In iterative algorithms like Neural Networks, we can monitor the validation error during training and stop as soon as it begins to rise, effectively "selecting" the best version of the model automatically.