How to do clustering in Python

In this article, we try to explain in simple words what is clustering and how to do it in Python.

What is clustering?

Clustering is a way to regroup data points into a bigger entity and assign a label to this entity. This is most often used to do classification.

Here is an example

Given irises sepal and petal length, can we tell the different species?

A plot of data collected on Irises. The sepal vs the petal length.

As you can see nothing is apparent on the chart.

The role of clustering is to define an arbitrary metric that will try to classify these data points into different groups. e.g. Species

An attempt at clustering from a visual standpoint

We could stipulate that there are 2, 3, or more species depicted on this plot.

Nowadays scientists did some clustering based on characteristics of the different irises available around the world.

We can see that with a margin of error we can tell the species of an Iris given its sepal and petal length. This is because we already have enough data to create clusters.