How to do clustering in Python

2 min

In this article, we try to explain in simple words what is clustering and how to do it in Python.

What is clustering?

Clustering is a way to regroup data points into a bigger entity and assign a label to this entity. This is most often used to do classification.

Here is an example

Given irises sepal and petal length, can we tell the different species?

A plot of data collected on Irises. The sepal vs the petal length.

As you can see nothing is apparent on the chart.

The role of clustering is to define an arbitrary metric that will try to classify these data points into different groups. e.g. Species

An attempt at clustering from a visual standpoint

We could stipulate that there are 2, 3, or more species depicted on this plot.

Nowadays scientists did some clustering based on characteristics of the different irises available around the world.

Scientists clustering of irises

We can see that with a margin of error we can tell the species of an Iris given its sepal and petal length. This is because we already have enough data to create clusters.

How to do clustering?

There exist today a multitude of methods that will help you do clustering.

Scikit-learn regroups a great variety of clustering methods.

Examples
Release Highlights: These examples illustrate the main features of the releases of scikit-learn. Release Highlights for scikit-learn 1.0 Release Highlights for scikit-learn 1.0, Release Highlights ...
The link to the list of methods
A lot of methods as you can see. source: scikit-learn.org

Beginners could approach the problem with the K-means methods which tries to separate samples in n groups of equal variance.

Scikit has a really good article of what happens under the hood and how to use the K-means clustering:

2.3. Clustering
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on trai...
How to use K-means clustering

Here you are! You have now all the resources to do simple clustering in Python!