The Effect of Dimensionality Reduction in k-Means Clustering

Performing k-Means clustering on the PCA-transformed data vs original data

Rukshan Pramoditha
5 min readJul 18, 2023

Both PCA and k-Means are unsupervised machine learning techniques. Both work with unlabeled data. How about combining these two techniques?

Sounds interesting?

PCA has many use cases. Generally, PCA is used for reducing the number of input features in the dataset.

k-Means is a clustering algorithm that groups similar instances into clusters.

Will k-Means perform well with PCA-transformed data, rather than with original data? Today, we will perform k-Menas on both PCA-transformed data and original data to find out this.

Import and preprocess data

We will use the Wine dataset to perform PCA and k-Means clustering. The dataset has 178 training instances and 13 features.

# Loading the wine dataset
from sklearn.datasets import load_wine

wine = load_wine()
X = wine.data
print("Shape:", X.shape)

We need to apply feature scaling to the data. This is because PCA requires features with similar scales and also k-Means assumes that all features are equally scaled.

# Feature scaling
from…

--

--

Rukshan Pramoditha
Rukshan Pramoditha

Written by Rukshan Pramoditha

3,000,000+ Views | BSc in Stats | Top 50 Data Science, AI/ML Technical Writer on Medium | Data Science Masterclass: https://datasciencemasterclass.substack.com

Responses (1)