How to do clustering in Python

7-Day Challenge

Land Your First Data Science Job

A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.

Build portfolios that hiring managers love
Master the Python and SQL essentials to be industry-ready
Practice with real interview questions from tech companies
Access to the $100k/y Data Scientist Cheatsheet

Join thousands of developers who transformed their careers through our challenge. Unsubscribe anytime.

In this article, we try to explain in simple words what is clustering and how to do it in Python.

What is clustering?

Clustering is a way to regroup data points into a bigger entity and assign a label to this entity. This is most often used to do classification.

Here is an example

Given irises sepal and petal length, can we tell the different species?

A plot of data collected on Irises. The sepal vs the petal length.

As you can see nothing is apparent on the chart.

The role of clustering is to define an arbitrary metric that will try to classify these data points into different groups. e.g. Species

An attempt at clustering from a visual standpoint

We could stipulate that there are 2, 3, or more species depicted on this plot.

Nowadays scientists did some clustering based on characteristics of the different irises available around the world.

Scientists clustering of irises

We can see that with a margin of error we can tell the species of an Iris given its sepal and petal length. This is because we already have enough data to create clusters.

How to do clustering?

There exist today a multitude of methods that will help you do clustering.

Scikit-learn regroups a great variety of clustering methods.

Examples
Release Highlights: These examples illustrate the main features of the releases of scikit-learn. Release Highlights for scikit-learn 1.0 Release Highlights for scikit-learn 1.0, Release Highlights ...
The link to the list of methods
A lot of methods as you can see. source: scikit-learn.org

Beginners could approach the problem with the K-means methods which tries to separate samples in n groups of equal variance.

Scikit has a really good article of what happens under the hood and how to use the K-means clustering:

2.3. Clustering
Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on trai...
How to use K-means clustering

Here you are! You have now all the resources to do simple clustering in Python!

7-Day Challenge

Land Your First Data Science Job

A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.

Build portfolios that hiring managers love
Master the Python and SQL essentials to be industry-ready
Practice with real interview questions from tech companies
Access to the $100k/y Data Scientist Cheatsheet

Join thousands of developers who transformed their careers through our challenge. Unsubscribe anytime.

Free Newsletter

Master Data Science in Days, Not Months 🚀

Skip the theoretical rabbit holes. Get practical data science skills delivered in bite-sized lessons – Approach used by real data scientist. Not bookworms. 📚

Weekly simple and practical lessons
Access to ready to use code examples
Skip the math, focus on results
Learn while drinking your coffee

By subscribing, you agree to receive our newsletter. You can unsubscribe at any time.