In this article, we try to explain in simple words what is clustering and how to do it in Python.
What is clustering?
Clustering is a way to regroup data points into a bigger entity and assign a label to this entity. This is most often used to do classification.
Here is an example
Given irises sepal and petal length, can we tell the different species?
As you can see nothing is apparent on the chart.
The role of clustering is to define an arbitrary metric that will try to classify these data points into different groups. e.g. Species
We could stipulate that there are 2, 3, or more species depicted on this plot.
Nowadays scientists did some clustering based on characteristics of the different irises available around the world.
We can see that with a margin of error we can tell the species of an Iris given its sepal and petal length. This is because we already have enough data to create clusters.
How to do clustering?
There exist today a multitude of methods that will help you do clustering.
Scikit-learn regroups a great variety of clustering methods.
Beginners could approach the problem with the K-means methods which tries to separate samples in n groups of equal variance.
Scikit has a really good article of what happens under the hood and how to use the K-means clustering:
Here you are! You have now all the resources to do simple clustering in Python!