K Means Clustering Algorithm

7 min readOct 1, 2022

Table of Content

K-Means Clustering Intuition
Real-life situations of Clustering
Choosing the Value of K
Elbow method
Code for K-Means Clustering Algorithm and for Elbow point.

Machine learning algorithms are categorized into three main categories.

Supervised Learning
Unsupervised Learning
Reinforcement Learning

In Supervised Learning, in a given Dataset, we have a Class Label or a Target Variable present.

In Unsupervised Learning, all you know is a set of features, and you nothing about your target variable or a class label.

But we try to identify the underlying structure in the data, or we sometimes try to find the clusters in the data for making useful predictions from it.

In this article, I will be taking you through the K Means Clustering.

K Means is a very popular Clustering algorithm.

Wait, What do you mean by clustering?

The method of identifying similar groups of data in a dataset is called clustering. Cluster analysis is a technique used in Machine Learning that attempts to find clusters of observations within a dataset.

The primary objective of cluster analysis is to find groupings(clusters) in which the observations within each cluster are very similar, but the observations in different clusters are very different.

Some real-life situations of Clustering are:

Biology: It can be used for classification among different species of plants and animals.
Libraries: It is used in clustering different books on the basis of topics and information.
Insurance: It is used to acknowledge the customers, and their policies and to identify the frauds.
Marketing: It can be used to characterize & discover customer segments for marketing purposes.
Clustering languages
Image Segmentation...ETC

We can identify some structure in it and one way of looking into this is three clusters just by visual examination.

The above graph has three clusters. K-means help you identify these clusters. K in K-Means is a free parameter where before you start the algorithm you have to tell the algorithm what is the value of k.

K-Means Clustering Intuition

K-Means clustering is used to identify and infer intrinsic groups within an unlabeled dataset. It is based on centroid-based clustering.

Centroid:
- A centroid is a data point at the center of a cluster. Clusters are represented by a centroid in centroid-based clustering.
- It is an iterative algorithm that determines similarity based on how close a data point is to the cluster’s centroid.
- The K-Means clustering algorithm uses an iterative procedure to deliver a final result.
- The algorithm requires number of clusters K and the data set as input.

Let the graph be

Step 1: Start with L centroids by putting them at a random place, Here K=2

Step 2: Identify the distances and separate them accordingly. Compute the distance of every point from the centroid and cluster them accordingly.

The above clusters may not be optimal. We should make them better and better for each step, by adjusting the centroid for these two clusters.

Step 3: Adjust centroids so that they become the center of gravity for a given cluster.

Repeat the same process.

Recompute the distance between the points and the centroids and make new clusters.

Step 4: Re-cluster every point based on their distances with the centroid.

Repeat the process until the cluster does not change its position anymore.

Step 5: Adjust centroids

Step 6: Recompute clusters and repeat this till data points stop changing clusters.

After this, we are done, The centroids changes no more and we get these two clusters.

Choosing the Value of K

How to determine the correct number of clusters (k)?

The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-defined value of K.
To determine the number of clusters in the data, we must run the K-Means clustering algorithm for various values of K and compare the results.
The performance of the K-Means algorithm depends upon the value of K. So, choosing the optimal value of K that gives the best performance is important.
The Elbow method is the most common technique.

We took two clusters (k=2) in the above examples. But there may be a different number of clusters too.

IF K==6.

But our job is to find the best K-Value

Method

Start with a random k value.
Compute SSE (Sum of Squared Errors)
SSE - The SSE is defined as the sum of the squared Euclidean distances of each point to its closest centroid. Since this is a measure of error, the objective of k-means is to try to minimize this value.
Computer SSE for all the clusters formed. Then find the sum of the SSEs you got.

Repeat the process of computing SSE for various K-Values.

Plot the SSE vs K-Values Graph

Observation from Graph

We can observe that for increasing the value of K, the SSE is decreasing.

The main observation from the above plot is to observe the Elbow Point.

The Elbow point is a good K-Value. (K==4)

So the optimal number of Clusters can be 4.

Let us work on a Dataset to find clusters and also the Elbow point. (K-Value)

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import load_iris
%matplotlib inline
sns.set(style=’whitegrid’)

Load the Data

df= pd.read_csv(“/content/income.csv”)
df.head()

Plot between Age and Income column

plt.scatter(df.Age,df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')

Use KMeans Clustering Algorithm and divide it into clusters accordingly

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted

Output:

array([2, 2, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0])

Add the y_predicted values to the dataset

df['cluster']=y_predicted
df.head()

Get the coordinates of the centroid that are formed from the algorithm.

km.cluster_centers_

Output:

array([[  3.29090909e+01,   5.61363636e+04],
       [  3.82857143e+01,   1.50000000e+05],
       [  3.40000000e+01,   8.05000000e+04]])

Plot the centroids according to the clusters.

df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
plt.show()

The centroids are not optimal and also clusters.

Scale the data

scaler = MinMaxScaler()
scaler.fit(df[[‘Income($)’]])
df[‘Income($)’] = scaler.transform(df[[‘Income($)’]])
scaler.fit(df[[‘Age’]])
df[‘Age’] = scaler.transform(df[[‘Age’]])
df.head()

Repeat the above process. Because we have scaled the data.

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[[‘Age’,’Income($)’]])
df['cluster']=y_predicted
print(km.cluster_centers_)
df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroid')
plt.legend()

Here are the final clusters we got.

WAIT, WE DIRECTLY CONSIDERED THE K-VALUE AS 3.

WHY?

Let us look into it.

Remember Elbow Method.

Code:

sse = []
k_rng = range(1,10)
for k in k_rng:
    km = KMeans(n_clusters=k)
    km.fit(df[[‘Age’,’Income($)’]])
    sse.append(km.inertia_)

Plot K vs SSE

plt.xlabel(‘K’)
plt.ylabel(‘Sum of squared error’)
plt.plot(k_rng,sse)

We can observe the elbow point is 3.

So we considered K-Value as 3.

Yey… Finally, we are done.

If you find this article useful, give some claps. 😀 and if you find any mistakes, feel free to mention them in the comments.

Check out my other blogs

Statistics Concepts in Data-Science

Why Statistics

amansinganamala.medium.com

Pandas ( Summary Function, Data Types, Missing Data)

Know how to handle with data with pandas.