K-Means clustering algorithm
k-means clustering is an unsupervised machine learning algorithm used for clustering. unsupervised learning is when we don’t have a target variable in mind and we just want to know the data well, find patterns in the data. we don’t need labeled dataset for this.
clustering is a commonly used unsupervised technique used to get an intuition about the structure of the data. You can see the this as task of grouping the data into multiple subgroup according to their similarity to each other. where data points in same subgroup are very similar to each other and data points in different subgroup are very different to each other. these subgroups are called clusters. how we measure the similarity? well, we use distance measure as Euclidean distance or other distance measures. since this is a part of unsupervised learning we don’t have the ground truth to compare our results with something. we just want to investigate the data. now that we know clustering let’s dive into one of the popular clustering algorithm, k-means.
the K in K-Means stands for the number of clusters we will divide the dataset into. if K = 3 then we will divide the whole data set into three distinct subgroups. we have predefine the K (number of clusters). it’s an iterative algorithm. that means it calculates the centroid and assigns the data points in an iterative manner. it will continue the updating task until the centroids are not changing it’s position.
in this method data points are assigned to clusters in such a way that distance between the data points and the centroid are as small as possible.
How it works:
I hope this will help you..
step 1: at first we need to specify the number of cluster that the algorithm should generate. so define the values for K.
step 2: select k randomly chosen data points as centroid without replacement. it can be a points other that existing data point.
step 3: calculate the sum of squared distance between all the data points and all the centroids.
step 4: assign or allocate every data points to it’s closest centroid.
step 5: compute the new centroid of each clusters by averaging all the data points present in that clusters.
step 6: repeat step 4 then step 5. this step will continue until no new data points are assigning to a cluster or the centroids are not changing its positions.
Expectation — maximization :
k-means follows the E-M approach to solve the problem. simply put, E step is assigning the data points to it’s closest clusters and M step is used to compute the new centroid of each clusters. if you want to know more about this technique please refer to this article.
Graphical Explanation :
let’s visualize the algorithm in graphical manner.
this is our imaginary data set
step 1: choose value for k. for our example let’s say k = 3.
step 2 : select k randomly chosen point to be the first set of centroids.
step 3 : measures the distance between all the data points and all the centroids.
step 4: allocate each data points to its nearest centroid to be in its clusters.
the three clusters here represented as red, green and blue colors.
step 4: calculate the new positions of centroids by averaging the data points.
the hollow star shaped points are new centroids of corresponding clusters.
step 5: now reassign the points to it’s closest centroid.
two point changed it’s position. we will continue this process until no new assignment is possible.
unlike supervised learning algorithms, here we don’t know the ground truth to evaluate the model performance. cluster analysis doesn’t have such evaluation metric. the performance of the k-means cluster algorithm depends highly upon the the number of cluster it forms. choosing the optimal value of K is a big task since the model doesn’t learn it from data. in this article we will discuss two such method that can give us some intuition about the value of K.
1. Elbow method
2. Silhouette analysis
Elbow method is one of the most popular ways to find the optimal number of clusters. this method uses WCSS value(within cluster sum of square). which defines the total variations within a cluster.
the formula for WCSS is
where i stands for each clusters and j for each points. C_i are centroids of i-th clusters.
distance here can be any measure technique such as Euclidean Distance or Manhattan Distance.
It works the in the following way:
- it executes the the k-means clustering on the given dataset for different values of k (generally ranges from 1 to 10)
- for each value of k , calculates the WCSS value
- plots a curve between K value and WCSS value
- the sharp point of bend or the point of the plot looks like an arm then that point is considered as optimal value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method.
the graph suggests 3 would be good choice for our cluster analysis. sometimes it is still difficult to find a optimal number of clusters because the curve would be monotonically decreasing and may not show any elbow . if we choose the number of cluster equal to the number of data points present in our data set then the value of WCSS would be zero. and the plot will touch the x axis.
for this please refer to this amazing well explained article of analytics vidya.
it has explained this concept very well.
that’s the end of this article. please comment your reaction of this post in the comment section .
you can follow me on