How to Cross-Validate a Logistic Regression Model Trained on Class-Imbalanced Data

To mitigate overfitting and improve generalizing capability

Rukshan Pramoditha

--

A good machine learning model should generalize to new unseen data.

When the model is trained too well on the training data, it tends to overfit the training data and fails to generalize to new unseen data.

One way to avoid this problem is to cross-validate our models.

Cross-validation refers to splitting the training set (or sometimes, the entire dataset) into multiple folds (subsets) and using one fold as a validation set and the remaining folds as a train set. We change the validation fold at different iterations. The evaluation scores at each iteration are averaged out to get a more robust evaluation score for the model — by the author

There are many types of cross-validation (CV) techniques, but here we use only two of them.

k-fold CV is the most popular one. You will get consistent results with low variance, but it will not perform well with class-imbalanced datasets. To address this problem, we have to use stratified k-fold CV. All of these will be discussed in detail in the following sections.

On the other hand, logistic regression is a simple machine-learning algorithm that can be used for binary classification.

Training the model without cross-validation

# Getting data
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

# Splitting data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.80,
random_state=42,
shuffle=True,)

# Building the model
from sklearn.linear_model import LogisticRegression

lgr = LogisticRegression()
lgr.fit(X_train, y_train) # Training

#…

--

--

Rukshan Pramoditha

2,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com/