Improving your ingestion performance using the multidisciplinary approach

When it comes to solving a problem, developers often tend to use the technologies they’re comfortable with — the technologies they work with on a daily basis.

Most of the time this is fine — we work within a specific scope in which this technology makes sense and has already proven relevant and efficient.

But have you ever encountered this case where, while it works, your solution is either too slow, consumes too many resources or is simply unsatisfying?

This article aims to introduce you to the multidisciplinary approach, which consists in making different disciplines work together to reach a goal.

That’s a long word but tell me more

Although you might focus on one or a few specific technologies, in a software company you’re likely to work with other people tackling other subjects with different tools.

The first step will be to study the disciplines at hand to see what each of them can achieve and whether it’s worth investing in them or not.

Once the list of technologies you see fit is established, you need to bridge them to work together to implement an actual solution.

Technical case study: importing massive files

A recurring mission in our catalog team is to import huge files regularly.

Such tasks are kind of a routine, but the scripts we create can sometimes take a lot of time to run — ranging from one to over a dozen of hours. With an ever growing quantity of data to ingest and a limited pool of back-end machines, this started to become an issue.

We recently had to rework from scratch the way music lyrics are imported, that was the perfect opportunity to try something different.

The problem

We receive lyrics in the form of gzip compressed files of about 10 million lines, with each line being a JSON string. Every day.

Most of the content in these files is already in our databases, so on a day-to-day basis the ratio of new lines, updated lines, and deleted lines is rather small. Yet the solutions that were implemented in the past are not always efficient enough to tackle this problem in an acceptable time frame.

The legacy way

Back in the days when these files didn’t exceed 1 million lines, the most naive solution was adopted: a loop to insert or update each line along with a reference to the file, after which we would delete all entries containing a different reference.

There are at least two problems with this approach:

  • First, we had to process every single line from the file to hydrate an entity with it.
  • Second, we’d update every single line in the database — as long as it’s in the file — in order to change the reference. Even with transactions that’s a lot of useless queries and it’s downright slow.

It’s also worth mentioning that the reference column needs to be indexed for the delete query to run efficiently.

Gotta go fast

A couple years back we decided to improve our ingestion process, and implemented another concept.

The SQL way

The idea here was to use what we called buffer tables — basically a copy of our regular table minus the indexes and other additions — that would act as a way to compute differences (additions, updates, deletions) with our real tables.

The first step consists in formatting every line in the file and inserting it in the buffer table (if no formatting was necessary and our input file was a CSV, we could import it straight into MySQL, which would be sweet).
Insertions and updates are then handled by a single insert on duplicate key update query and the deletion is handled by a single update left join query where the right columns are null.

Auto generated query for insertion and update

Auto generated query for deletion

We tend to use soft delete by disabling content rather than actually removing the line

There are a couple of advantages with this solution:

  • It allows us to calculate the differences before actually applying them which enables us to have a protection mechanism to cancel the import in case an abnormal amount of lines changed (never trust files coming from third party sources!).
  • MySQL is rather efficient at batch update since we rely on the primary key.

However some issues remain:

  • Every single line from the file has to be formatted and inserted in a table.
  • It forces us to have twice as many tables and entity / repository classes.
  • It’s still slow overall (a couple hours).

That’s when I started to ask myself what could be done differently and experiment with a bunch of other solutions.

The multidisciplinary way

My goal here was to find a better way to compute a diff.
Since MySQL wouldn’t make it easy, I decided to run a diff against the previous file instead.

After a few experiments I came up with the idea of indexing all the lines from my file into a key-hash YAML document.

Given the following file where the foo property would be our id:

{“foo”: “aaa”, “bar”: [{“a”: 1, “b”: 2}]}
{“foo”: “bbb”, “bar”: [{“y”: 4, “z”: 5}]}

The following YAML would be produced:

aaa: 7dc9c52dbf3e5a436ac2a40affed4d16
bbb: 288a8a3a1c2a8a370ea20b88b9b0f426

I then used an open-source Go tool to run a diff between the 2 generated YAML files and parse the output to either insert, update or delete lines.

The result was much faster (down to a few minutes), but I met new problems:

  • Very memory intensive.
  • I would still have to loop through the whole original document to retrieve the data to insert or update since the parsed diff only gives me the line id and an actually useless hash.

But wait a minute. Aren’t I basically trying to compare two datasets?
Sounds like the daily routine of our data engineers.

Let’s upload our files to HDFS (Hadoop Distributed File System) and open a notebook on Zeppelin (a Spark interpreter within a web interface) to try some stuff as a proof of concept.

Complex nested JSON dictionaries would require some more work

And that’s it, we’re now down to a couple of minutes only.

Now as we were discussing at the beginning of this article, the multidisciplinary approach consists in making different disciplines work together to reach a goal.

We still have to plug everything together though, so our back-ends can communicate with our big data clusters and vice versa.

Interactions between back-end and big data could be designed by either:

  • Using a messaging system such as Kafka.
  • Triggering jobs on both sides using our schedulers’ APIs, and access files as well as generated diffs using a volume or an ftp.

Conclusion

By using the different tools at our disposal we have been able to create a working architecture with different technologies effectively communicating with each other.

The PHP back-end is still responsible for importing the lyrics, however, the computation of which lines should be inserted, updated, or removed has been delegated to the big data cluster.

Doing so allows us to speed up the whole ingestion process a hundredfold, which reduces its runtime from multiple hours to just a couple of minutes and a very low CPU time for our PHP back-ends — a resource we strive for.

We believe such an approach can be applied to solve many problems and are working more and more closely with other teams — notably by enabling bridges between our stacks.