Why message queue size matters when programming with Actors?

Recently, I had to wrangle data from a 50GB file containing close to 34 million lines. This wrangling involved selecting lines based on some pattern and then pulling data from the selected lines.

The naive solution would have been to sequentially process each line. Since I had a multi-core laptop, I decided to make it work for me by scripting a parallel solution in Groovy using GPars library.

After some digging around, I decided to use Actor model support in GPars. The basic structure of the solution was as follows.

  1. Create a collating actor that collates data from processing actors.
  2. Create N processing actors that will process a given line — search for the pattern, pull the data from the line, and send the data to the collating actor.
  3. Read each line from the file and asynchronously send it to one of the processing actors. (File reading thread)

Skimming thru GPars documentation, I got to a working script in about an hour. It ran pretty well on a small slice (~3 million lines) of the original file.

So, I ran the script on the original file. Interestingly (or as with most software effort), the execution consumed all of 8GB of heap space and ran for a while before completion. While it did the job, I was intrigued with the amount of consumed memory. The above solution was processing the lines without storing them in memory and the data pulled from the lines amounted to a few thousand words that were stored in a set container (no duplicates). So, as a good software developer, I went down the rabbit hole :)

After spending a few hours digging into GPars documentation, profiling the script with VisualVM, whipping up a parallel solution in Python (which helped me understand an important aspect of Futures in Python), and making noise about a “bug” in GPars, I figured out the issue (err, my bad).

I had not considered the size of the communication channel (queue) between the consumers (processing actors) and the producer (file reading thread).

So, the queues were acting likes buffers and the program was storing the lines in memory!! Arghhh!!!

Looking back, every programming model has some “tricky” aspect. If synchronization via locks and semaphores is such an aspect of thread-based programming, then getting the queue size right is such an aspect of actor-based programming. That said, programming with Actors is a way easier than programming with Threads. Everyone should try it :)

For them curious cats, the final solution is available as a gist.

Written by

Programming, experimenting, writing | Past: SWE, Researcher, Professor | Present: SWE

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store