Model Selection

Model selection is the process of choosing the best model from a set of candidate models. This involves balancing model complexity to achieve the best generalization on new data.

1. The Validation Strategy

We cannot use the Test Set to choose our model, as that would "leak" information and lead to over-optimistic results. Instead, we split the training data further:

Training Set (e.g., 70-80%): Used to fit the parameters ( $\boldsymbol{\theta}$ ) of the model.
Validation Set (e.g., 10-20%): Used to tune Hyperparameters (e.g., learning rate $\alpha$ , polynomial degree $d$ ) and choose the best model architecture.
Test Set (e.g., 10%): Used ONLY once at the very end to estimate real-world performance on unseen data.

2. Hyperparameter Tuning

Hyperparameters are settings that are chosen before the learning process begins. They are not learned by Gradient Descent. We use the Validation Set to find their optimal values.

Example: Selecting the Learning Rate $\alpha$

Model	Learning Rate ( $\alpha$ )	Training Error	Validation Error	Status
Model 1	0.1	High (213)	High (217)	Too High
Model 2	0.01	Low (218)	Low (210)	Optimal
Model 3	0.001	Low (310)	Low (311)	Too Slow

Notice how Model 2 is the best choice because it has the lowest validation error, even if its training error is slightly higher than Model 1.

3. k-Fold Cross-Validation

When data is scarce, we use k-Fold Cross-Validation:

Split training data into $k$ "folds".
Train $k$ times, each time using $k-1$ folds for training and 1 fold for validation.
Average the validation performance across all $k$ runs.

k-Fold Cross Validation (k=5)

Training

Validation

Iter 1

Valid

Train

Iter 2

Train

Valid

Train

Iter 3

Train

Valid

Train

Iter 4

Train

Valid

Train

Iter 5

Train

Valid

The dataset is split into 5 equal folds. In each iteration, one fold is used for validation and the other 4 folds are used for training. The final score is the average across all 5 iterations.

4. Visualizing the Selection Curve

The optimal model is located at the point where the validation error is at its minimum.

The Model Selection Curve

As model complexity increases, training error decreases. However, test error (generalization error) initially decreases then starts to increase once overfitting begins.

💡

Early Stopping: In iterative algorithms like Neural Networks, we can monitor the validation error during training and stop as soon as it begins to rise, effectively "selecting" the best version of the model automatically.

Overfitting Regularization

Model Selection

1. The Validation Strategy

2. Hyperparameter Tuning

Example: Selecting the Learning Rate α\alphaα

3. k-Fold Cross-Validation

k-Fold Cross Validation (k=5)

4. Visualizing the Selection Curve

The Model Selection Curve

Example: Selecting the Learning Rate $\alpha$