How to filter a DataFrame

2 min

Here is how you can apply a filter on a pandas DataFrame.

We define a sample DataFrame

import pandas as pd

# We read a sample dataset from the web.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
We read our sample dataset

You can download it here if the above script doesn't work for you and read it like so

import pandas as pd

# We read a sample dataset
df = pd.read_csv('./iris.csv')
We read our sample dataset

One conditional statement mask

A condition within squared brackets

# The mask
mask = df["sepal_length"] > .4

# We apply the mask
print(df[mask])
Defining a conditional statement as a mask

it is also possible to it in one-line

print(df[df["sepal_length"] > .4])
Directly within the squared brackets

N conditional statements

It is possible to add more than one conditional statement like so

# The mask
mask_1 = df["sepal_length"] > .4
mask_2 = df["sepal_width"] > 3.1

# We apply the mask
print(df[mask_1 & mask_2])
Defining two conditional statements as masks 

Or the oneliner :

# We apply the mask
print(df[(df["sepal_length"] > .4) & (df["sepal_width"] > 3.1)])
In one line of code

What can you filter for?

Where text equals

# We apply the mask
print(df[df["species"] == "setosa"])
Find all the setosa occurences

Where text contains

There is a similar method for filtering where the text contains a string.

That might be useful when you are trying to filter out a specific email out of an email list and knows the prefix before the @ sign.

# We apply the mask
print(df[df["species"].str.contains("vir")])
We select only the species containing "vir"

Filter for specific dates

Here is how to filter for dates that start on a specific date and end on another specific date.

# we import the library
import pandas as pd

dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")

# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
                   "col1":range(len(dates))})

# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]

Boolean filter

# we import the Pandas library
import pandas as pd

dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")

# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
                   "col1":range(len(dates))})

# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]