The Pitch
So, I have joined a travel startup recently, and since my first day here, I have been focusing on organizing the humongous data from different sources that fuel the business, using data pipelines and sharding strategies.
The Problem
A part of the problem was to solve this huge data import problem. All we needed was to read a CSV file with 45 Million rows (records) -> transform them into a model schema and then load them into a SQL database (read Postgres). The existing system takes an insane amount of time (in the range of 10’s of hours) for this entire ETL operation. Now, I love solving problems the “Distributed way” but this operation is a not a part of the engineering roadmap. And so, we didn’t want to invest a lot of time designing a sophisticated system that (who knows) may be scraped out in near future. We wanted a smart, fast but simple solution that is not resource intensive.
The Approach
It’s not a new problem, and there are possibly a dozen different ways of doing this.
I broke the problem in to two different parts:
- Keep reading the file asynchronously in a buffer and simultaneously, keep reading from the buffer and perform the transformation and load
- An efficient and fast way of loading the models onto the database
I have taken a liking for Golang in recent times and decided to give it a go 😛If you have not tried out this amazing language, Go for it, give it a try. Golang has this concept of a channel and a buffered channel that seems like an exact fit for the problem space. We would start a goroutine that keeps reading from the file and putting the records in the channel. Make sure to run this function on a goroutine.
Now that we had the first part sorted, for the second part and just for fun and of course to compare the performances, I had sampled 1Million records from the file. I used the gorm library for ORM, that helped me avoid a lot of boilerplates.
Plain and simple approach :
I wanted to see how much time it would take to insert the models one by one. So it all started fine, and I kept staring at the screen for the first 5 minutes. Then I went to grab some coffee, and then went out to run some errands, and by the time I was back, my machine was running in full throttle. The process finished in about 45 minutes.
Use the “Copy” command :
Postgres has a copy command that dumps (quite literally dump) data into the db, dropping the index. And to be honest, it was quite fast, I mean the magic happened in a few seconds – 15 secs to be exact. But the downside is, we need to enable index, that would take time depending on the size of data. In my opinion, copy is fine if we are performing a one time data loading, but for subsequent insertions, it won’t cut the deal.
Using “Transaction”
The idea was to wrap a bunch of record inserts into one transaction, We could do this in batches of 1000 records per transaction. That way, we batch the data and have some granularity on the record set. The advantage is we are still able to use the indexes, without compromising too much on time, and without being too much resource intensive. The load took around 3-5 minutes to finish.
References: