# KNNImputer for Filling Missing Data in Data Preprocessing

## K-Nearest Neighbors (KNN) Algorithm for Handling Missing Data

The K-Nearest Neighbors (hereafter, KNN) is a supervised machine-learning algorithm that uses a ** k** number of nearest (closest) neighbors to classify an instance into a relevant class.

Neighbors of an instance are found using the Euclidean distance. The Euclidean distance between two data points is calculated using the following formula.

**x = (x1, x2, …, xn)**

y = (y1, y2,…, yn)

** n** is the dimension of the space. For example, when n=2, the distance between

**x**and

**y**or

**d(x, y)**is calculated on the 2-dimensional space.

**can be any higher dimension.**

*n*In KNN, ** k** is a hyperparameter that we need to define during the execution of the algorithm. Depending on the value of

**, the same instance may be classified into different classes! So, we need to properly define a value for**

*k***.**

*k*# The intuition behind KNNImputer

The KNNImputer utilizes the KNN algorithm to impute the missing values in the dataset. The replacement values are calculated using the uniform mean or weighted mean of the nearest neighbors which is specified in the ** n_neighbors **hyperparameter. When k=3, for example, the algorithm considers 3 neighbors for each data point to calculate the distances (d1, d2 and d3).

# Important hyperparameters in KNNImputer

In Python, KNN-based imputation can be performed using the Scikit-learn’s **KNNImputer()** class. Here are the most important hyperparameters of that class.

`from sklearn.impute import KNNImputer`

import numpy as np

imputer = KNNImputer(missing_values=np.nan,

n_neighbors=5,

weights='uniform')

**missing_values:**The placeholder for the missing data. The default is`np.nan`

. All np.nan values will be imputed.**n_neighbors:**The number of nearest neighbors. This takes an integer. The default is 5.