Elbow Method for optimal value of k in KMeans - GeeksforGeeks (2024)

Last Updated : 10 May, 2023

Improve

Prerequisites: K-Means Clustering

In this article, we will discuss how to select the best k (Number of clusters) in the k-Means clustering algorithm.

Introduction To Elbow Method

A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. Since we do not have any predefined number of clusters in unsupervised learning. We tend to use some method that can help us decide the best number of clusters. In the case of K-Means clustering, we use Elbow Method for defining the best number of clustering

What Is the Elbow Method in K-Means Clustering

As we know in the k-means clustering algorithm we randomly initialize k clusters and we iteratively adjust these k clusters till these k-centroids riches in an equilibrium state. However, the main thing we do before initializing these clusters is that determine how many clusters we have to use.

For determining K(numbers of clusters) we use Elbow method. Elbow Method is a technique that we use to determine the number of centroids(k) to use in a k-means clustering algorithm. In this method to determine the k-value we continuously iterate for k=1 to k=n (Here n is the hyperparameter that we choose as per our requirement). For every value of k, we calculate the within-cluster sum of squares (WCSS) value.

WCSS - It is defined as the sum of square distances between the centroids andeach points.

Now For determining the best number of clusters(k) we plot a graph of k versus their WCSS value. Surprisingly the graph looks like an elbow (which we will see later). Also, When k=1 the WCSS has the highest value but with increasing k value WCSS value starts to decrease. We choose that value of k from where the graph starts to look like a straight line.

Implementation of the Elbow Method Usking Sklearn in Python

We will see how to implement the elbow method in 4 steps. At first, we will create random dataset points, then we will apply k-means on this dataset and calculate wcss value for k between 1 to 4.

Step 1: Importing the required libraries

Python3

from sklearn.cluster import KMeans

from sklearn import metrics

from scipy.spatial.distance import cdist

import numpy as np

import matplotlib.pyplot as plt

Step 2: Creating and Visualizing the data

We will create a random array and visualize its distribution

Python3

# Creating the data

x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6,\

7, 8, 9, 8, 9, 9, 8, 4, 4, 5, 4])

x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, \

1, 2, 1, 2, 3, 2, 3, 9, 10, 9, 10])

X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

# Visualizing the data

plt.plot()

plt.xlim([0, 10])

plt.ylim([0, 10])

plt.title('Dataset')

plt.scatter(x1, x2)

plt.show()

Output:

From the above visualization, we can see that the optimal number of clusters should be around 3. But visualizing the data alone cannot always give the right answer. Hence we demonstrate the following steps.

We now define the following:-

Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters to each data point. Typically, the Euclidean distance metric is used.

 Distortion = 1/n * Σ(distance(point, centroid)^2)

Inertia: It is the sum of the squared distances of samples to their closest cluster center.

 Inertia = Σ(distance(point, centroid)^2)

We iterate the values of k from 1 to n and calculate the values of distortions for each value of k and calculate the distortion and inertia for each value of k in the given range.
Step 3: Building the clustering model and calculating the values of the Distortion and Inertia:

Python3

distortions = []

inertias = []

mapping1 = {}

mapping2 = {}

K = range(1, 10)

for k in K:

# Building and fitting the model

kmeanModel = KMeans(n_clusters=k).fit(X)

kmeanModel.fit(X)

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,

'euclidean'), axis=1)) / X.shape[0])

inertias.append(kmeanModel.inertia_)

mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,

'euclidean'), axis=1)) / X.shape[0]

mapping2[k] = kmeanModel.inertia_

Step 4: Tabulating and Visualizing the Results
a) Using the different values of Distortion:

Python3

for key, val in mapping1.items():

print(f'{key} : {val}')

Output:

1 : 3.6255513311970012 : 2.03182385331125963 : 1.24233033917441524 : 0.83677387083864615 : 0.7369797544248596 : 0.68982548101124227 : 0.60203116217709518 : 0.52345963639828269 : 0.4587221418509788

Next we will plot the graph of k versus WCSS

Python3

plt.plot(K, distortions, 'bx-')

plt.xlabel('Values of K')

plt.ylabel('Distortion')

plt.title('The Elbow Method using Distortion')

plt.show()

Output:

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (2)

graph of k vs distortion in k-means

b) Using the different values of Inertia:

Python3

for key, val in mapping2.items():

print(f'{key} : {val}')

Output:

1 : 312.952380952380962 : 108.071428571428563 : 39.517460317460314 : 17.9785714285714285 : 14.4452380952380966 : 11.4166666666666687 : 9.2666666666666678 : 7.259 : 6.5

Python3

plt.plot(K, inertias, 'bx-')

plt.xlabel('Values of K')

plt.ylabel('Inertia')

plt.title('The Elbow Method using Inertia')

plt.show()

Output:

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (3)

graph of k versus insertion

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia starts decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 4.

Clustered Data Points For Different k Values

We will plot images of data points clustered for different values of k. For this, we will apply the k-means algorithm on the dataset by iterating on a range of k values.

Python3

import matplotlib.pyplot as plt

# Create a range of values for k

k_range = range(1, 5)

# Initialize an empty list to

# store the inertia values for each k

inertia_values = []

# Fit and plot the data for each k value

for k in k_range:

kmeans = KMeans(n_clusters=k, \

init='k-means++', random_state=42)

y_kmeans = kmeans.fit_predict(X)

inertia_values.append(kmeans.inertia_)

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans)

plt.scatter(kmeans.cluster_centers_[:, 0],\

kmeans.cluster_centers_[:, 1], \

s=100, c='red')

plt.title('K-means clustering (k={})'.format(k))

plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.show()

# Plot the inertia values for each k

plt.plot(k_range, inertia_values, 'bo-')

plt.title('Elbow Method')

plt.xlabel('Number of clusters (k)')

plt.ylabel('Inertia')

plt.show()

Output:

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (4)

distribution data points for 1 centroid

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (5)

distribution data points for 2 centroid

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (6)

distribution data points for 3 centroid

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (7)

distribution data points for 4 centroid



A

AlindGupta

Improve

Next Article

ML | Determine the optimal value of K in K-Means Clustering

Please Login to comment...

Elbow Method for optimal value of k in KMeans - GeeksforGeeks (2024)

FAQs

What is the elbow method to find optimal K? ›

The elbow method is a graphical method for finding the optimal K value in a k-means clustering algorithm. The elbow graph shows the within-cluster-sum-of-square (WCSS) values on the y-axis corresponding to the different values of K (on the x-axis). The optimal K value is the point at which the graph forms an elbow.

How to find the optimal value of k in K-means? ›

This method is a visual technique used to determine the best K value for a k-means clustering algorithm. In this method, a graph known as the elbow graph plots the within-cluster-sum-of-square (WCSS) values against various K values. The optimal K value is identified at the point where the graph bends like an elbow.

Which technique is used to choose the optimal value for K ____? ›

The Elbow Method

This is probably the most well-known method for determining the optimal number of clusters. It is also a bit naive in its approach. Calculate the Within-Cluster-Sum of Squared Errors (WSS) for different values of k, and choose the k for which WSS becomes first starts to diminish.

What does K mean elbow method inertia? ›

K-Means: Inertia

A good model is one with low inertia AND a low number of clusters ( K ). However, this is a tradeoff because as K increases, inertia decreases. To find the optimal K for a dataset, use the Elbow method; find the point where the decrease in inertia begins to slow. K=3 is the “elbow” of this graph.

How to optimize k-means clustering? ›

  1. 1 Choose a smart initialization. One of the main factors that affect the performance of K-means clustering is how the initial cluster centers are chosen. ...
  2. 2 Scale your data. ...
  3. 3 Use a faster algorithm. ...
  4. 4 Evaluate your clusters. ...
  5. 5 Experiment with different parameters.
Mar 9, 2023

How will you choose the optimal k value in the kNN algorithm? ›

So, the following things must be considered while choosing the value of K:
  1. K should be the square root of n (number of data points in the training dataset).
  2. K should be chosen as the odd so that there are no ties. If the square root is even, then add or subtract 1 to it.
Feb 20, 2024

Is K the optimal value? ›

The k represents the Y coordinate of the Vertex. Also known as the Optimal value. The optimal value is the lowest or highest value in the parabola. The axis of symmetry is always written like y= optimal value.

What is optimal k-means clustering in one dimension? ›

Optimal 1-D clustering by dynamic programming

The problem of 1-D k-means clustering is defined as assigning elements of the input 1-D array into k clusters so that the sum of squares of within-cluster distances from each element to its corresponding cluster mean is minimized.

Which of these is a method of finding the best value for k? ›

The optimal K value usually found is the square root of N, where N is the total number of samples. Use an error plot or accuracy plot to find the most favorable K value.

Which method can be used to find k-means clustering? ›

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running K-Means clustering on the dataset. Next, we use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set.

What is optimal K for KNN regression? ›

Square Root of N rule: This rule offers a quick and practical way to determine an initial k value for your KNN model, especially when no other domain-specific knowledge or optimization techniques are readily available. The rule suggests setting k to the square root of N.

What is the optimal K elbow method? ›

The Elbow Method is a visual approach used to determine the ideal 'K' (number of clusters) in K-means clustering. It operates by calculating the Within-Cluster Sum of Squares (WCSS), which is the total of the squared distances between data points and their cluster center.

Why do we use elbow method in clustering? ›

The elbow method is used to determine the optimal number of clusters in k-means clustering. The elbow method plots the value of the cost function produced by different values of k.

How do you find the optimal number of clusters in K-means? ›

The optimal number of clusters can be defined as follow: Compute clustering algorithm (e.g., k-means clustering) for different values of k. For instance, by varying k from 1 to 10 clusters. For each k, calculate the total within-cluster sum of square (wss).

What is the optimal elbow angle? ›

Elbows. Your elbows should be relaxed and held close to the side of the body, creating an approximately 90-degree angle between the arm and the forearm.

What is the elbow method and silhouette method? ›

Elbow and Silhouette method is one method to determine the validity of a clustering method using internal index. Internal index is a way to determine the validity of the clusters without external information. Two of the main data used to determine internal index is cluster cohesion and cluster separation.

How do you choose optimal K in KNN? ›

To get the optimal value of K, you can segregate the training and validation from the initial dataset. Now plot the validation error curve to get the optimal value of K. This value of K should be used for all predictions.

How to use elbow method in R? ›

The steps to perform the elbow method are:
  1. Select a range of k values, usually from 1 to 10 or the square root of the number of observations in the dataset.
  2. Run k-means clustering for each k value and calculate the SSE (sum of squared distances of each point from its closest centroid).
  3. Plot the SSE for each k value.
Mar 2, 2023

References

Top Articles
Latest Posts
Article information

Author: Terrell Hackett

Last Updated:

Views: 6409

Rating: 4.1 / 5 (72 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.