Cross-validation plays an essential role in evaluating machine learning models.
The main intention of cross-validation on machine learning models is to prevent overfitting and improve the generalizing capability of the models.
Overfitting occurs when the model is trained too well on the training data but poorly performs on new unseen data. That kind of model tries to memorize the training data and fails to generalize on new unseen data.
We’re often familiar with train-test splits. If we do not have a separate dataset for testing the model, we divide the same dataset into train and test splits.
The train set is used to train the model. The model parameters learn their values from the train set. The final evaluation is done on the test set which has never been seen by the model during training.
Apart from the train-test sets, there is another set which is called the validation set.
The validation set is used to train the model multiple times with different combinations of hyperparameters. Therefore, it is used for hyperparameter tuning.
Actually, we can split the entire dataset into train, test and validation sets by calling the Scikit-learn train_test_split() function twice!
from sklearn.model_selection import train_test_split
X_train, X_rem, y_train, y_rem = train_test_split(X, y,
X_valid, X_test, y_valid, y_test = train_test_split(X_rem,y_rem,
When splitting in this way, the train set (X_train and y_train parts) includes 70% of the instances, the test set (X_test and y_test parts) includes 15% of the instances and the same for the validation set (X_valid and y_valid parts).
In most cases, we do not create a separate validation set in this way. This is where cross-validation comes into play.
Instead of creating a separate validation set, we split…