How to do clustering in Python

In this article, we try to explain in simple words what is clustering and how to do it in Python.

What is clustering?

Clustering is a way to regroup data points into a bigger entity and assign a label to this entity. This is most often used to do classification.

Here is an example

Given irises sepal and petal length, can we tell the different species?

A plot of data collected on Irises. The sepal vs the petal length.

As you can see nothing is apparent on the chart.

The role of clustering is to define an arbitrary metric that will try to classify these data points into different groups. e.g. Species

An attempt at clustering from a visual standpoint

We could stipulate that there are 2, 3, or more species depicted on this plot.

Nowadays scientists did some clustering based on characteristics of the different irises available around the world.

We can see that with a margin of error we can tell the species of an Iris given its sepal and petal length. This is because we already have enough data to create clusters.

How to do clustering?

There exist today a multitude of methods that will help you do clustering.

Scikit-learn regroups a great variety of clustering methods.

A lot of methods as you can see. source: scikit-learn.org

Beginners could approach the problem with the K-means methods which tries to separate samples in n groups of equal variance.

Scikit has a really good article of what happens under the hood and how to use the K-means clustering:

Here you are! You have now all the resources to do simple clustering in Python!

Land Your First Data Science Job

What is clustering?

Here is an example

How to do clustering?

Land Your First Data Science Job

Master Data Science in Days, Not Months 🚀