Clustering using k-Means with implementation (2024)

Clustering using k-Means with implementation (3)

Clustering is a technique to find natural groups in the data.

Clustering using k-Means with implementation (4)

If we show the above picture to a kid, he can identify that there are four types of animals. He may not know the names of all of them, but he can still identify that there are four different types and he can do this independently, without the help of an adult. As we don’t need an adult to supervise, clustering is an unsupervised technique.

The three motivations can be listed as follows.

Underlying Structure: We can understand that there are different groups, within the groups, there is less variation and between the group, the variation is lesser.

Natural Classification: We know that there are four different types or groups of animals.

Summarization: If we now, think of some strategy for maybe their shelter or food, I don’t need to think for all the animals, because a sheep’s food requirement will be different from a Panda, Giraffe or Rhyno but not so much different from another sheep. This can be very important when any company is trying to come up with a customer group specific strategy.

k-means is arguably the most popular algorithm, which divides the objects into k groups. This has numerous applications as we want to find structure in data. We want to group students, customers, patients, products, movies, tweets, images, and whatnot.

Let’s now see how it works.

Input: Data will be given and the value of ‘k’ will be given

Clustering using k-Means with implementation (5)

Step 1: Initialize random ‘k’ points from the data as the cluster centers, let’s assume the value of k is 2 and the 1st and the 4th observation is chosen as the centers.

Clustering using k-Means with implementation (6)

Step 2: For all the points, find the distance from the k cluster centers. Euclidean Distance can be used. The following table is used as an illustration. C1 and C2 columns indicate the distance from the centers respectively.

Clustering using k-Means with implementation (7)

If for the 2nd observation, we want to determine the distance from centroid 1, the calculation will look like the following

Clustering using k-Means with implementation (8)
  • Notice that the distance of the first point from center 1 is 0 and the distance of the fourth point from the second cluster is 0. The reason is simple, these points were selected as the centroids
  • Also for the third observation, the distance from both the centers are the same.

Step 3: Every point is assigned to the cluster center it is nearest to and then the cluster center is again computed using arithmetic means

Clustering using k-Means with implementation (9)

For the first two points, they are assigned to the cluster 1, the third point is randomly assigned and the last four points are assigned to cluster 2.

Clustering using k-Means with implementation (10)

This Step 2 and Step 3 is repeated until convergence

  • No change in cluster center from iteration to iteration
  • Very Small change in the cluster assignment

Essentially, we are trying to find k best representative points. These are selected by taking mean. Hence, we are trying to find the k best representative points iteratively. Hence the name k-means.

Some of the issues of k-means are

  • As the first step is random initialization, this has a bearing on the quality of the clusters ( Some of it can be solved by k-means ++, which selects the random points incrementally, making sure that the next point is selected only when it is far away from the selected points)
  • Picks circles of circular shape ( The distance measure can be played with to a certain extent)
  • Are affected by outliers ( Outliers can be removed and then clustering can be applied)
  • The value of k needs to be given by the user ( An Approach to find the same is discussed)
Clustering using k-Means with implementation (11)

A good cluster is one where the points are close to the center. How can we find the closeness? We can find the euclidian distance of all the points in that cluster from the cluster center. Now we need to extend this to all clusters. This idea is represented in the equation above and is called the within-cluster sum of the square. Ideally, we would want this to be small. The extreme case, we can have one cluster for each point resulting in WCSS as 0. It also turns out as we increase k, WCSS continues to decrease. Technically, this is called as monotonically decreasing.

We want to rather stop, where there is a steepest fall or forming kind of an elbow when we are plotting WCSS in the y-axis and the number of clusters in the x-axis.

Standardization: All distance-based algorithms need the data to be standardized so that no one attribute gets undue importance just because the range of the values is more. The code snippet is given below.

# Importing the Standard Scaler
from sklearn.preprocessing import StandardScaler
#Intializing the Stnadard Scaler
scaler = StandardScaler().fit(df)
#Standardizing the data
df_scaled = scaler.transform(df)

Clustering: The code snippet is given below

#Imporing the Library
from sklearn.cluster import KMeans
# Intialization
kmeans = KMeans(n_clusters = 3, init = 'random', max_iter = 300, n_init = 10, random_state = 0)
#Applying Clustering
y_kmeans = kmeans.fit_predict(df_scaled)

Some important Parameters:

  • n_clusters: Number of clusters or k
  • init: Random or kmeans++ ( We have already discussed how kmeans++ gives better initialization)
  • n_init: The algorithm is run with these many different random choices of initial k points. The best is returned.
  • max iteration: How many iterations the algorithm will run for
  • tol: This stands for tolerance and is also used for the termination criterion. This is an interesting concept. If it is a small change in the centers from one iteration to another then we are towards convergence. The centers are matrices and we can find closeness between two matrices by checking the element-wise euclidian distance. This is called the Frobenius norm.

Determining the value of ‘k’

The code snippet is given below.

# Applying k-means for diffrent value of k and storing the WCSS
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'random', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(x_scaled)
wcss.append(kmeans.inertia_)

For plotting against the number of clusters

plt.plot(range(1, 11), wcss)
plt.title('The elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('SSE') #within cluster sum of squares
plt.show()
Clustering using k-Means with implementation (12)
  • You can check out our full implementation here.
  • You can also look at a video tutorial which talks more about clustering here

Conclusion:

k-means is here to stay, because of its simplicity, ease of implementation. It easily appears in the top 10 algorithms of ML in any survey. There are shortcomings and also way around to make k-means robust.

Reference:

[1] https://towardsdatascience.com/k-means-clustering-13430ff3461d

Clustering using k-Means with implementation (2024)

References

Top Articles
New Latest Web Series List 2022 on OTT in India Hindi & English with Release Dates - Bollymoviereviewz
20 Best Indian web series of all time: The Night Manager, Aspirants to Sacred Games
Avonlea Havanese
Best Team In 2K23 Myteam
30 Insanely Useful Websites You Probably Don't Know About
Cad Calls Meriden Ct
Ventura Craigs List
The Potter Enterprise from Coudersport, Pennsylvania
Blairsville Online Yard Sale
Is Sportsurge Safe and Legal in 2024? Any Alternatives?
Wfin Local News
Western Razor David Angelo Net Worth
Best Private Elementary Schools In Virginia
Strange World Showtimes Near Amc Braintree 10
Syracuse Jr High Home Page
Robert Malone é o inventor da vacina mRNA e está certo sobre vacinação de crianças #boato
What is Rumba and How to Dance the Rumba Basic — Duet Dance Studio Chicago | Ballroom Dance in Chicago
50 Shades Darker Movie 123Movies
Puretalkusa.com/Amac
Mission Impossible 7 Showtimes Near Marcus Parkwood Cinema
Bekijk ons gevarieerde aanbod occasions in Oss.
Pawn Shop Moline Il
Telegram Voyeur
Uncovering the Enigmatic Trish Stratus: From Net Worth to Personal Life
How To Improve Your Pilates C-Curve
ATM, 3813 N Woodlawn Blvd, Wichita, KS 67220, US - MapQuest
Melissa N. Comics
Colin Donnell Lpsg
Wake County Court Records | NorthCarolinaCourtRecords.us
Orange Pill 44 291
Scioto Post News
Senior Houses For Sale Near Me
Junee Warehouse | Imamother
Claim loopt uit op pr-drama voor Hohenzollern
Noaa Marine Weather Forecast By Zone
Mid America Clinical Labs Appointments
Oppenheimer Showtimes Near B&B Theatres Liberty Cinema 12
Tedit Calamity
Dinar Detectives Cracking the Code of the Iraqi Dinar Market
Cocorahs South Dakota
Mitchell Kronish Obituary
Walmart 24 Hrs Pharmacy
La Qua Brothers Funeral Home
25 Hotels TRULY CLOSEST to Woollett Aquatics Center, Irvine, CA
Bf273-11K-Cl
Theatervoorstellingen in Nieuwegein, het complete aanbod.
What Does the Death Card Mean in Tarot?
F9 2385
Wrentham Outlets Hours Sunday
Call2Recycle Sites At The Home Depot
Runelite Ground Markers
Lake County Fl Trash Pickup Schedule
Latest Posts
Article information

Author: Greg Kuvalis

Last Updated:

Views: 6415

Rating: 4.4 / 5 (75 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Greg Kuvalis

Birthday: 1996-12-20

Address: 53157 Trantow Inlet, Townemouth, FL 92564-0267

Phone: +68218650356656

Job: IT Representative

Hobby: Knitting, Amateur radio, Skiing, Running, Mountain biking, Slacklining, Electronics

Introduction: My name is Greg Kuvalis, I am a witty, spotless, beautiful, charming, delightful, thankful, beautiful person who loves writing and wants to share my knowledge and understanding with you.