Member-only story
KNNImputer for Filling Missing Data in Data Preprocessing
K-Nearest Neighbors (KNN) Algorithm for Handling Missing Data
The K-Nearest Neighbors (hereafter, KNN) is a supervised machine-learning algorithm that uses a k number of nearest (closest) neighbors to classify an instance into a relevant class.
Neighbors of an instance are found using the Euclidean distance. The Euclidean distance between two data points is calculated using the following formula.
x = (x1, x2, …, xn)
y = (y1, y2,…, yn)
n is the dimension of the space. For example, when n=2, the distance between x and y or d(x, y) is calculated on the 2-dimensional space. n can be any higher dimension.
In KNN, k is a hyperparameter that we need to define during the execution of the algorithm. Depending on the value of k, the same instance may be classified into different classes! So, we need to properly define a value for k.
The intuition behind KNNImputer
The KNNImputer utilizes the KNN algorithm to impute the missing values in the dataset. The replacement values are…