How to store data efficiently in Python
Land Your First Data Science Job
A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.
Whatever you do in Python you will need to store data for the long term.
There are many things to consider beforehand.
- What kind of data do you want to save?
- How often do you save this data?
- How often do you need to access it?
In this article, we are going to focus mainly on local data storage, but it is good to know that you could also save it online to allow others to access it.
The formats
The most commonly used format in the industry are the following :
- text
- csv
- json
- excel
- pickle
- hdf5
- parquet
It is important to note that each of these formats has its own best use case.
TEXT
A text file format is just a file containing plain text without any structure.
If you have to store text once that could be the one. But keep in mind that it is always easier to work with structured data.
CSV
This is probably the one you will encounter the most in Python.
This file type is used to represent a DataFrame format. A bit like an excel spreadsheet.
I wrote an article on How to read/save a CSV file on how to use it with the Pandas library.
JSON
JSON is one of the most famous formats for sharing data across the web.
It is simple to understand and very good a keeping things tidy.
Here are two articles that will show you how to read and write in JSON format.
Excel
Excel is one of the most famous file formats used by millions of people. However, it is not the preferred one for web services or any kind of programmatic data exchange.
It is not lightweight and it is proprietary.
You will surely need to read or write such files, so here are two articles that will help you with that.
Pickle
The Pickle format is one of the formats used for serializing Python objects.
This format is useful for storing Python objects as binary files and rereading them without losing their nature.
HDF5
HDF5 is one of the formats used for Big Data and is efficient for storing matrices.
This is one of the most efficient ways to store big DataFrames which contains a lot of numbers. This format is ideal for Machine Learning datasets.
Parquet
Parquet is another format that is famous for Big Data and for storing huge amounts of data.
Parquet is efficient at compressing data and is an Open Source Project specially made for the Hadoop ecosystem.
Here you are! You now know everything there is to know about file formats.
More on DataFrames
If you want to know more about DataFrame and Pandas. Check out the other articles I wrote on the topic, just here :
Land Your First Data Science Job
A proven roadmap to prepare for $75K+ entry-level data roles. Perfect for Data Scientist ready to level up their career.
Related Articles
Continue your learning journey with these related topics
Master Data Science in Days, Not Months 🚀
Skip the theoretical rabbit holes. Get practical data science skills delivered in bite-sized lessons – Approach used by real data scientist. Not bookworms. 📚