KNNImputer for Filling Missing Data in Data Preprocessing

K-Nearest Neighbors (KNN) Algorithm for Handling Missing Data

Rukshan Pramoditha


The K-Nearest Neighbors (hereafter, KNN) is a supervised machine-learning algorithm that uses a k number of nearest (closest) neighbors to classify an instance into a relevant class.

Neighbors of an instance are found using the Euclidean distance. The Euclidean distance between two data points is calculated using the following formula.

x = (x1, x2, …, xn)
y = (y1, y2,…, yn)

n is the dimension of the space. For example, when n=2, the distance between x and y or d(x, y) is calculated on the 2-dimensional space. n can be any higher dimension.

In KNN, k is a hyperparameter that we need to define during the execution of the algorithm. Depending on the value of k, the same instance may be classified into different classes! So, we need to properly define a value for k.

The intuition behind KNNImputer

(Image by author)

The KNNImputer utilizes the KNN algorithm to impute the missing values in the dataset. The replacement values are calculated using the uniform mean or weighted mean of the nearest neighbors which is specified in the n_neighbors hyperparameter. When k=3, for example, the algorithm considers 3 neighbors for each data point to calculate the distances (d1, d2 and d3).

Important hyperparameters in KNNImputer

In Python, KNN-based imputation can be performed using the Scikit-learn’s KNNImputer() class. Here are the most important hyperparameters of that class.

from sklearn.impute import KNNImputer
import numpy as np

imputer = KNNImputer(missing_values=np.nan,
  • missing_values: The placeholder for the missing data. The default is np.nan. All np.nan values will be imputed.
  • n_neighbors: The number of nearest neighbors. This takes an integer. The default is 5.



Rukshan Pramoditha

2,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: