Member-only story
Approximate Nearest Neighbors (ANN) vs K-Nearest Neighbors (KNN) for Large Datasets
Sacrificing some quality in favour of speed
K-Nearest Neighbors (KNN) is a popular supervised learning algorithm in Machine Learning (ML). It can be used for classification and regression tasks, although it is widely used in classification.
The major challenge in KNN is choosing the right k value which is the number of neighbor instances considered for a new-instance classification. Technically, k is a hyperparameter in KNN. The user needs to define this value.
Let’s assume k=5. The algorithm considers the nearest 5 data points to the new instance. The new instance will be classified to the majority of the class in the nearest neighbours. Let’s say there are three classes (red, green, blue) in the dataset, and the nearest neighbours have 3 red points out of 5. The new instance will be assigned to the red class (Source: Choosing the Right Number of Neighbors for the K-Nearest Neighbors Algorithm).
The KNN algorithm utilizes the Euclidean distance formula to calculate the distances between the new instance and all other existing…