KNNImputer for Filling Missing Data in Data Preprocessing

K-Nearest Neighbors (KNN) Algorithm for Handling Missing Data

Rukshan Pramoditha
4 min readAug 29, 2023

The K-Nearest Neighbors (hereafter, KNN) is a supervised machine-learning algorithm that uses a k number of nearest (closest) neighbors to classify an instance into a relevant class.

Neighbors of an instance are found using the Euclidean distance. The Euclidean distance between two data points is calculated using the following formula.

x = (x1, x2, …, xn)
y = (y1, y2,…, yn)

n is the dimension of the space. For example, when n=2, the distance between x and y or d(x, y) is calculated on the 2-dimensional space. n can be any higher dimension.

In KNN, k is a hyperparameter that we need to define during the execution of the algorithm. Depending on the value of k, the same instance may be classified into different classes! So, we need to properly define a value for k.

The intuition behind KNNImputer

(Image by author)

The KNNImputer utilizes the KNN algorithm to impute the missing values in the dataset. The replacement values are…

--

--

Rukshan Pramoditha

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com