When threading library can help perform better parallel map operations in Python
After stumbling on a “bug” (which was my gaffe) in GPars library while scripting a solution in Groovy, I decided to code up the same solution in Python using concurrent.futures (and multiprocessing) module.
The basic steps in my solution were as follows.
- Read lines from a huge file, i.e., 50GB / 34 million lines.
- Apply a function to each line.
- Collate the output of above function applications.
Clearly, each step could be set up a stage in a pipeline where each stage executed in parallel (along with intra-stage parallelism). And, off I went to implement this design.
Take 1: With concurrent.futures, the obvious solution was to do a parallel map operation on the lines from the file using a pool-based executor and then reduce the results of the map operation. I scripted this solution and found the script running out of memory before the parallel map operation in step 2 completed. The obvious reason for this behavior was the synchronous nature of parallel map operation.
Executor.map() operation does not return until jobs/tasks are created to process every data item (in this case, lines of the file).
Consequently, the map operation results were being stored away in memory until the map operation completed.
Take 2: While this issue could be addressed by using an asynchronous map() operation, concurrent.futures module did not support asynchronous map() operation.
So, after some poking around, I figured that the main thread could spawn a (job creating) thread to perform step 2 — submit jobs to a pool-based executor — and then concurrently perform step 3. This change also required setting up a synchronized queue between the job creating thread and the main thread. The created jobs were passed to the main thread via this queue and the main thread processed the results of these jobs. I implemented this change using the threading module of Python and it worked like a charm on a small slice (~3 million lines) of the original file!!
Take 3: When the solution was applied to the original file, it consumed more than 8GB of memory. This was rather high for a solution that merely processes few lines at a time. After some thought, I figured the size of the queue was unconstrained and, hence, it was acting like a buffer that stored large chunks of the file. The fix was simple — limit the size of the queue. With the fix in place, the solution worked consuming less than 4GB of memory.
Looking back, this experience was interesting as it exposed the limitations of high-level abstractions for parallel programming (e.g., parallel map) and illustrated how low-level abstractions (e.g., threads) can help address limitations of high-level abstractions.
In an ideal Python world, the map operation (step 2) would return immediately with a “lazy/async iterator” that would block during iteration (step 3). However, we are not there yet (or, at least, I haven’t found my way there yet).
Until we/I get there, it is best to be familiar with all levels of abstractions for parallel programming in Python.
For them curious cats, the final solution is available as a gist.