Data Preprocessing for K-Nearest Neighbors (KNN)

Feature scaling and encoding

Rukshan Pramoditha
5 min readApr 14, 2024

Data preprocessing is a mandatory task for any ML algorithm and KNN is not an exception!

Previously, I’ve published an article on KNN. There, we discussed six effective methods for choosing the right number of neighbors (the value of k) for the KNN algorithm.

Today, we will discuss two essential data preprocessing methods used for KNN: Feature scaling and Feature encoding.

Here is a part of the dataset that can be used to build a KNN regression model.

import pandas as pd
df = pd.read_csv("diamonds.csv")
df.head()

It is important to know that KNN can be used for both regression and classification tasks. Imagine that you’re going to build a regression model using KNN for the above dataset by taking the price column as the label column. We just ignore the x, y and z variables for simplicity.

Why feature scaling for KNN?

The KNN calculations are based on the Euclidean distance function which is very sensitive to the relative scale of the input features.

The larger values will dominate this function if the features are not measured on a similar scale. To bring all feature values back into a similar…

--

--

Rukshan Pramoditha

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com