Working with large csv-files in pandas? Create a SQL-database by reading files in chunks

It is not uncommon to have to deal with for instance csv-files containing millions of rows. Searching, filtering and slicing can therefore be time-consuming tasks. So, the question is then: Are there any ways to speed up the process? If so, this could save a considerable amount of time for any data scientist needing to perform queries on a regular basis.

Anyone who has ever worked with SQL must have noticed that it is quite efficient at retrieving specific data from large databases. Well, can this be done in Python? The answer is of course “yes”. In what follows, I’ll give you an example on how to use the pandas and sqlalchemy libraries to simplify search for large csv-files.

First step: creating a database

There are several ways to create a database in python and it mostly depends on the data you have. The determining factor on which approach to take is without a doubt the size of the csv-file you want to turn into a database. In any case, you should first install the sqlalchemy library using

pip install sqlalchemy

Once this is done, you are required to create a database engine and give the path to the file you want to work with

import pandas as pd
from sqlalchemy import create_engine
import sys
import time

file = "....\Fact.csv"
csv_database = create_engine('sqlite:///csv_database.db', echo=False)

df = pd.read_csv(file, sep=";", dtype='unicode' )

I added the library time only to be able to measure the time taken to create the database. Note that I have read the csv-file, which is a 7,9 million rows long and 42 columns. The function to_sql allows one to write the dataframe df into the database that has been created:

import time
start = time.time()

df.to_sql("FACTS2", if_exists = 'replace',con=csv_database)

end = time.time()
print(end - start)

Note that I have given the connection to the create database csv_database. Another important setting to to_sql is how to handle the fact that the table being created already exists, whereof the if_exists parameter. In this case, I am not planning on appending other data, hence if the table already exists, I just replace it. Failing to do so will result in failure.

The entire process took an unacceptable 66,6 minutes!

As you can see, reading the entire df in one piece is a time consuming process. For larger files it might not even be possible because of Memory issues. Hence, a wiser approach is to partition the data into chunks and append each of them to the table that has been created. There are several ways to do this, but I believe this one might be the quickest.

start = time.time()
chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True, sep=";", dtype='unicode'):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+=1
df.to_sql('FACT', csv_database, if_exists='append')
j = df.index[-1] + 1
end = time.time()
print(end - start)

In this case, the process to around 520 seconds or 8,7 minutes. A clear improvement! Here we have a single table in the database and it might not make much sense to actually create the database. However, new tables can be added and with the appropriate keys between them, one can easily use standard SQL queries to subset the data.

Querying the database

To query data from the database you simply need to apply standard SQL. Here is an example:

result = csv_database.execute("SELECT Date, Value1, Value2, Machine FROM FACTS2 where Date = '20190901' and ID = '16201707'")

This is the query that will be passed on to the engine:

start = time.time()
Filtered = pd.DataFrame(result.fetchall())
Filtered.columns = result.keys()
end = time.time()
print(end - start)
Filtered

The equivalent procedure in pandas would be

start = time.time()
Filtered = df.loc[(df['Date']=='20190901') & (df['ID']== '16201707'),['Date', 'Value1', 'Value2', 'Machine']]
end = time.time()
print(end - start)

Now! Which is the faster of the two? Subsetting and filtering using pandas is actually much faster when querying a single table of that size….by a factor of 5. But, and there is a but…..

This is no longer true as soon as you add table to the database. Doing so and selecting rows conditionally on joins in the database is faster than merging dataframes in pandas and then subsetting the resulting dataframe. For very large csv-files it is actually preferable to create a db with sqlite. Another advantage is that data can be appended tables created in the database without having to read all the already existing data, something that you would have to do using only .loc in pandas.

I’ll leave this as an excercice! Enjoy!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Website Powered by WordPress.com.

Up ↑

%d bloggers like this: