K-Means Clustering with the Elbow method (2024)

K-means clustering is an unsupervised learning algorithm that groups data based on each point euclidean distance to a central point called centroid. The centroids are defined by the means of all points that are in the same cluster. The algorithm first chooses random points as centroids and then iterates adjusting them until full convergence.

An important thing to remember when using K-means, is that the number of clusters is a hyperparameter, it will be defined before running the model.

K-means can be implemented using Scikit-Learn with just 3 lines of code. Scikit-Learn also already has a centroid optimization method available, kmeans++, that helps the model converge faster.

To apply the K-means clustering algorithm, let's load the Palmer Penguins dataset, choose the columns that will be clustered, and use Seaborn to plot a scatter plot with color coded clusters.

K-Means Clustering with the Elbow method (2)

Note: You can download the dataset from this link.

Let's import the libraries and load the Penguins dataset, trimming it to the chosen columns and dropping rows with missing data (there were only 2):

import pandas as pdimport seaborn as snsimport matplotlib.pyplot as pltfrom sklearn.cluster import KMeansdf = pd.read_csv('penguins.csv')print(df.shape) # (344, 9)df = df[['bill_length_mm', 'flipper_length_mm']]df = df.dropna(axis=0)

We can use the Elbow method to have an indication of clusters for our data. It consists in the interpretation of a line plot with an elbow shape. The number of clusters is where the elbow bends. The x axis of the plot is the number of clusters and the y axis is the Within Clusters Sum of Squares (WCSS) for each number of clusters:

wcss = []for i in range(1, 11): clustering = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering.fit(df) wcss.append(clustering.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]sns.lineplot(x = ks, y = wcss);

The elbow method indicates our data has 2 clusters. Let's plot the data before and after clustering:

fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15,5))sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('Using the elbow method');

This example shows how the Elbow method is only a reference when used to choose the number of clusters. We already know that we have 3 types of penguins in the dataset, but if we were to determine their number by using the Elbow method, 2 clusters would be our result.

Since K-means is sensitive to data variance, let's look at the descriptive statistics of the columns we are clustering:

df.describe().T # T is to transpose the table and make it easier to read

This results in:

 count mean std min 25% 50% 75% maxbill_length_mm 342.0 43.921930 5.459584 32.1 39.225 44.45 48.5 59.6flipper_length_mm 342.0 200.915205 14.061714 172.0 190.000 197.00 213.0 231.0

Notice that the mean is far from the standard deviation (std), this indicates high variance. Let's try to reduce it by scaling the data with Standard Scaler:

from sklearn.preprocessing import StandardScalerss = StandardScaler()scaled = ss.fit_transform(df)

Now, let's repeat the Elbow method process for the scaled data:

wcss_sc = []for i in range(1, 11): clustering_sc = KMeans(n_clusters=i, init='k-means++', random_state=42) clustering_sc.fit(scaled) wcss_sc.append(clustering_sc.inertia_) ks = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]sns.lineplot(x = ks, y = wcss_sc);

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

This time, the suggested number of clusters is 3. We can plot the data with the cluster labels again along with the two former plots for comparison:

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(15,5))sns.scatterplot(ax=axes[0], data=df, x='bill_length_mm', y='flipper_length_mm').set_title('Without clustering')sns.scatterplot(ax=axes[1], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering.labels_).set_title('With the Elbow method')sns.scatterplot(ax=axes[2], data=df, x='bill_length_mm', y='flipper_length_mm', hue=clustering_sc.labels_).set_title('With the Elbow method and scaled data');

When using K-means Clustering, you need to predetermine the number of clusters. As we have seen when using a method to choose our k number of clusters, the result is only a suggestion and can be impacted by the amount of variance in data. It is important to conduct an in-depth analysis and generate more than one model with different _k_s when clustering.

If there is no prior indication of how many clusters are in the data, visualize it, test it and interpret it to see if the clustering results make sense. If not, cluster again. Also, look at more than one metric and instantiate different clustering models - for K-means, look at silhouette score and maybe Hierarchical Clustering to see if the results stay the same.

K-Means Clustering with the Elbow method (2024)

FAQs

K-Means Clustering with the Elbow method? ›

The elbow method is a graphical representation of finding the optimal 'K' in a K-means clustering. It works by finding WCSS (Within-Cluster Sum of Square) i.e. the sum of the square distance between points in a cluster and the cluster centroid.

What is the elbow method in KNN? ›

The Elbow Method, is a technique with which we can select a K-value by looking at the result given by every K. The steps for an elbow method are stated below: We choose a K in a range(say from 1 to 10) For every K value in this range, we fit the KNN model on the training set and get the result/error on the test set.

Are you still using the elbow method? ›

To decide which method is the winner, I then plotted the true number of clusters of each dataset (x-axis) and the number of clusters suggested by each method (y-axis). Clearly, the best method is the one that gets closer to the bisector. From the plot, it is clear that the Elbow method is by far the worst-performing.

What is the elbow rule? ›

The 12-6 elbow has been defined as a "straight up, straight down elbow strike" to a grounded opponent. The rule has been controversial in that it places an extraordinary amount of discretion on the referee in real time, and its value towards fighter safety is debatable.

What does K mean elbow method inertia? ›

K-Means: Inertia

A good model is one with low inertia AND a low number of clusters ( K ). However, this is a tradeoff because as K increases, inertia decreases. To find the optimal K for a dataset, use the Elbow method; find the point where the decrease in inertia begins to slow. K=3 is the “elbow” of this graph.

What is the elbow method in k-means clustering? ›

The elbow method is a graphical method for finding the optimal K value in a k-means clustering algorithm. The elbow graph shows the within-cluster-sum-of-square (WCSS) values on the y-axis corresponding to the different values of K (on the x-axis). The optimal K value is the point at which the graph forms an elbow.

Which K is the best based on elbow method? ›

The point at which the elbow shape is created is 5; that is, our K value or an optimal number of clusters is 5.

What are the pros and cons of the elbow method? ›

1 Elbow method

The advantage of this method is that it is easy to implement and visualize. The disadvantage is that it may not work well for data sets that do not have a clear elbow or have noisy or skewed data.

What is better than the elbow method? ›

The silhouette Method is also a method to find the optimal number of clusters and interpretation and validation of consistency within clusters of data. The silhouette method computes silhouette coefficients of each point that measure how much a point is similar to its own cluster compared to other clusters.

What type of data does k-means clustering work best with? ›

The type of data best suited for K-Means clustering would be numerical data with a relatively lower number of dimensions. One would use numerical data (or categorical data converted to numerical data with other numerical features scaled to a similar range) because mean is defined on numbers.

How do you use the elbow method? ›

In cluster analysis, the elbow method is a heuristic used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

How does the elbow work? ›

Your elbow hinges to bend and straighten your arm. It's also a pivot joint. Pivot joints rotate in place without moving out of their original position. Your elbow pivoting is what lets you turn your forearm over to move your palm up and down.

What is the elbow rule in factor analysis? ›

Exploratory Factor Analysis

A scree plot has an elbow shape, with higher values on eigenvalues spread farther apart than lower values as one moves across the x-axis. The number of factors above the elbow indicates the optimal number of factors for the model.

What are the limitations of K-means in machine learning? ›

Limitation 2: K-Means is sensitive towards outlier. Outliers can skew the clusters in K-Means in very large extent. Limitation 3: K-Means forms spherical clusters only. It fails when data is not spherical (e.g., oblong or elliptical) or of arbitrary shape.

What is the optimal number of clusters in K-means? ›

To determine the optimal number of clusters, we have to select the value of k at the “elbow” ie the point after which the distortion/inertia starts decreasing in a linear fashion. Thus for the given data, we conclude that the optimal number of clusters for the data is 4.

What is the elbow method for kNN? ›

In this section, we will use the elbow method to choose an optimal value of K for our K nearest neighbors algorithm. The elbow method involves iterating through different K values and selecting the value with the lowest error rate when applied to our test data.

What is elbow method loss function? ›

The elbow method plots the value of the cost function produced by different values of k. As you know, if k increases, average distortion will decrease, each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids.

What is the elbow method for image segmentation? ›

The Elbow method looks at the total WCSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't improve much better the total WCSS. Compute K-Means clustering for different values of K by varying K from 1 to 10 clusters.

What is the function of the elbow method in R? ›

elbow: The "Elbow" Method for Clustering Evaluation
  1. Description. Determining the number of clusters in a data set by the "elbow" rule.
  2. Value. Both elbow and elbow.btach return a `elbow' object (if a "good" k exists), which is a list containing the following components. ...
  3. Details. ...
  4. See Also.

References

Top Articles
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 6403

Rating: 4.9 / 5 (69 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.