How to filter a DataFrame
Land Your First Data Science Job
A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.
Here is how you can apply a filter on a pandas DataFrame.
We define a sample DataFrame
import pandas as pd
# We read a sample dataset from the web.
df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
You can download it here if the above script doesn't work for you and read it like so
import pandas as pd
# We read a sample dataset
df = pd.read_csv('./iris.csv')
One conditional statement mask
A condition within squared brackets
# The mask
mask = df["sepal_length"] > .4
# We apply the mask
print(df[mask])
it is also possible to it in one-line
print(df[df["sepal_length"] > .4])
N conditional statements
It is possible to add more than one conditional statement like so
# The mask
mask_1 = df["sepal_length"] > .4
mask_2 = df["sepal_width"] > 3.1
# We apply the mask
print(df[mask_1 & mask_2])
Or the oneliner :
# We apply the mask
print(df[(df["sepal_length"] > .4) & (df["sepal_width"] > 3.1)])
What can you filter for?
Where text equals
# We apply the mask
print(df[df["species"] == "setosa"])
Where text contains
There is a similar method for filtering where the text contains a string.
That might be useful when you are trying to filter out a specific email out of an email list and knows the prefix before the @ sign.
# We apply the mask
print(df[df["species"].str.contains("vir")])
Filter for specific dates
Here is how to filter for dates that start on a specific date and end on another specific date.
# we import the library
import pandas as pd
dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")
# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
"col1":range(len(dates))})
# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]
Boolean filter
# we import the Pandas library
import pandas as pd
dates = pd.date_range(start="2021-01-01", end="2022-01-02", freq="D")
# we create the sample dataframe with dates
df = pd.DataFrame({"date": dates,
"col1":range(len(dates))})
# We filter for rows that starts on the 2021-06-01 and ends on the 2021-07-01
df[(df["date"] >= "2021-06-01") & (df["date"] <= "2021-07-01")]
Land Your First Data Science Job
A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.
Related Articles
Continue your learning journey with these related topics
Master Data Science in Days, Not Months 🚀
Skip the theoretical rabbit holes. Get practical data science skills delivered in bite-sized lessons – Approach used by real data scientist. Not bookworms. 📚