Imputing Missing Data in Categorical Variables
Discussing KNeighborsClassifier, SimpleImputer and CategoricalImputer Methods
Many real-world datasets contain missing values which are the non-existent values for certain observations. Missing values are often encoded as NaNs or other placeholders.
The problem with having missing values in the dataset is that they are not compatible (supported) with popular machine learning packages such as Scikit-learn. So, it is not possible to use datasets with missing values as the input to build machine learning models with these ML packages.
Thus, we must discard entire rows or columns containing missing values or replace them with supported values. The first option is easy and straightforward, but you will lose much valuable data too. The second option is the best one because you will get the full value of the dataset.
Imputation is the process of replacing the missing values by inferring (using statistics such as mean, median or mode) them from the known part of the data or by performing machine learning calculations (finding the nearest neighbours) or by using an arbitrary number.
There are various methods to impute missing data in categorical variables which usually contain string as values (even though string values…