Model Selection
Model selection is the process of choosing the best model from a set of candidate models. This involves balancing model complexity to achieve the best generalization on new data.
1. The Validation Strategy
We cannot use the Test Set to choose our model, as that would "leak" information and lead to over-optimistic results. Instead, we split the training data further:
- Training Set (e.g., 70-80%): Used to fit the parameters () of the model.
- Validation Set (e.g., 10-20%): Used to tune Hyperparameters (e.g., learning rate , polynomial degree ) and choose the best model architecture.
- Test Set (e.g., 10%): Used ONLY once at the very end to estimate real-world performance on unseen data.
2. Hyperparameter Tuning
Hyperparameters are settings that are chosen before the learning process begins. They are not learned by Gradient Descent. We use the Validation Set to find their optimal values.
Example: Selecting the Learning Rate
| Model | Learning Rate () | Training Error | Validation Error | Status |
|---|---|---|---|---|
| Model 1 | 0.1 | High (213) | High (217) | Too High |
| Model 2 | 0.01 | Low (218) | Low (210) | Optimal |
| Model 3 | 0.001 | Low (310) | Low (311) | Too Slow |
Notice how Model 2 is the best choice because it has the lowest validation error, even if its training error is slightly higher than Model 1.
3. k-Fold Cross-Validation
When data is scarce, we use k-Fold Cross-Validation:
- Split training data into "folds".
- Train times, each time using folds for training and 1 fold for validation.
- Average the validation performance across all runs.
k-Fold Cross Validation (k=5)
The dataset is split into 5 equal folds. In each iteration, one fold is used for validation and the other 4 folds are used for training. The final score is the average across all 5 iterations.
4. Visualizing the Selection Curve
The optimal model is located at the point where the validation error is at its minimum.
The Model Selection Curve
As model complexity increases, training error decreases. However, test error (generalization error) initially decreases then starts to increase once overfitting begins.
Early Stopping: In iterative algorithms like Neural Networks, we can monitor the validation error during training and stop as soon as it begins to rise, effectively "selecting" the best version of the model automatically.