How to filter a Pandas DataFrame using Python

1 min

A DataFrame is one core element of the Pandas library. It is widely used in Data Science.

Filtering might come in handy when performing statistics, etc...

But what can you filter for?

We define a sample DataFrame

import pandas as pd

# We read a sample dataset from the web.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
We read our sample dataset

Where text equals

# We apply the mask
print(df[df["species"] == "setosa"])
Find all the setosa occurences

Where text contains

There is a similar method for filtering where the text contains a string.

That might be useful when you are trying to filter out a specific email out of an email list and knows the prefix before the @ sign.

# We apply the mask
print(df[df["species"].str.contains("vir")])
We select only the species containing "vir"

Where the number is bigger than

# We apply the mask
print(df[df["sepal_length"] > .4])
We select only the rows that contain a value bigger than .4

Where the number is lower or equal

# We apply the mask
print(df[df["sepal_length"] <= .4])
We select only the rows that contain a value lower or equal than .4

Filter for specific dates

Here is how to filter for dates that start on a specific date and end on another specific date.

# we import the library
import pandas as pd

dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")

# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
                   "col1":range(len(dates))})

# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]

Boolean filter

# we import the Pandas library
import pandas as pd

dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")

# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
                   "col1":range(len(dates))})

# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]