Imputing Missing Data in Categorical Variables

Discussing KNeighborsClassifier, SimpleImputer and CategoricalImputer Methods

Rukshan Pramoditha
5 min readNov 3, 2024

Many real-world datasets contain missing values which are the non-existent values for certain observations. Missing values are often encoded as NaNs or other placeholders.

The problem with having missing values in the dataset is that they are not compatible (supported) with popular machine learning packages such as Scikit-learn. So, it is not possible to use datasets with missing values as the input to build machine learning models with these ML packages.

Thus, we must discard entire rows or columns containing missing values or replace them with supported values. The first option is easy and straightforward, but you will lose much valuable data too. The second option is the best one because you will get the full value of the dataset.

Imputation is the process of replacing the missing values by inferring (using statistics such as mean, median or mode) them from the known part of the data or by performing machine learning calculations (finding the nearest neighbours) or by using an arbitrary number.

There are various methods to impute missing data in categorical variables which usually contain string as values (even though string values…

--

--

Rukshan Pramoditha
Rukshan Pramoditha

Written by Rukshan Pramoditha

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com