It is not uncommon to have to deal with for instance csv-files containing millions of rows. Searching, filtering and slicing can therefore be time-consuming tasks. So, the question is then: Are there any ways to speed up the process? If so, this could save a considerable amount of time for any data scientist needing to perform queries on a regular basis.
Anyone who has ever worked with SQL
must have noticed that it is quite efficient at retrieving specific data from large databases. Well, can this be done in Python? The answer is of course “yes”. In what follows, I’ll give you an example on how to use the pandas
and sqlalchemy
libraries to simplify search for large csv-files.
First step: creating a database
There are several ways to create a database in python and it mostly depends on the data you have. The determining factor on which approach to take is without a doubt the size of the csv-file you want to turn into a database. In any case, you should first install the sqlalchemy library using
pip install sqlalchemy
Once this is done, you are required to create a database engine and give the path to the file you want to work with
import pandas as pd
from sqlalchemy import create_engine
import sys
import time
file = "....\Fact.csv"
csv_database = create_engine('sqlite:///csv_database.db', echo=False)
df = pd.read_csv(file, sep=";", dtype='unicode' )
I added the library time
only to be able to measure the time taken to create the database. Note that I have read the csv-file, which is a 7,9 million rows long and 42 columns. The function to_sql
allows one to write the dataframe df into the database that has been created:
import time
start = time.time()
df.to_sql("FACTS2", if_exists = 'replace',con=csv_database)
end = time.time()
print(end - start)
Note that I have given the connection to the create database csv_database
. Another important setting to to_sql is how to handle the fact that the table being created already exists, whereof the if_exists
parameter. In this case, I am not planning on appending other data, hence if the table already exists, I just replace it. Failing to do so will result in failure.
The entire process took an unacceptable 66,6 minutes!
As you can see, reading the entire df in one piece is a time consuming process. For larger files it might not even be possible because of Memory issues. Hence, a wiser approach is to partition the data into chunks and append each of them to the table that has been created. There are several ways to do this, but I believe this one might be the quickest.
start = time.time()
chunksize = 100000
i = 0
j = 1
for df in pd.read_csv(file, chunksize=chunksize, iterator=True, sep=";", dtype='unicode'):
df = df.rename(columns={c: c.replace(' ', '') for c in df.columns})
df.index += j
i+=1
df.to_sql('FACT', csv_database, if_exists='append')
j = df.index[-1] + 1
end = time.time()
print(end - start)
In this case, the process to around 520 seconds or 8,7 minutes. A clear improvement! Here we have a single table in the database and it might not make much sense to actually create the database. However, new tables can be added and with the appropriate keys between them, one can easily use standard SQL queries to subset the data.
Querying the database
To query data from the database you simply need to apply standard SQL. Here is an example:
result = csv_database.execute("SELECT Date, Value1, Value2, Machine FROM FACTS2 where Date = '20190901' and ID = '16201707'")
This is the query that will be passed on to the engine:
start = time.time()
Filtered = pd.DataFrame(result.fetchall())
Filtered.columns = result.keys()
end = time.time()
print(end - start)
Filtered
The equivalent procedure in pandas would be
start = time.time()
Filtered = df.loc[(df['Date']=='20190901') & (df['ID']== '16201707'),['Date', 'Value1', 'Value2', 'Machine']]
end = time.time()
print(end - start)
Now! Which is the faster of the two? Subsetting and filtering using pandas is actually much faster when querying a single table of that size….by a factor of 5. But, and there is a but…..
This is no longer true as soon as you add table to the database. Doing so and selecting rows conditionally on joins in the database is faster than merging dataframes in pandas and then subsetting the resulting dataframe. For very large csv-files it is actually preferable to create a db with sqlite. Another advantage is that data can be appended tables created in the database without having to read all the already existing data, something that you would have to do using only .loc
in pandas.
I’ll leave this as an excercice! Enjoy!
Leave a Reply