Performance of map operation in mpi4py and multiprocessing modules
Recently, while reading about MPI, I stumbled on the mpi4py library that brings MPI to Python. Given my past familiarity with Python’s multiprocessing module, I was curious how would the performance of MPI compare the performance of multiprocessing module. So, I performed a small experiment.
Computing Julia Set
Since mpi4py is based on MPI, it exposes many MPI-based features that are not readily offered by multiprocessing. Also, via mpi4py.futures, mpi4py offers features such as map and starmap that are similar to those provided by multiprocessing. So, I decided to compare the performance of these modules based on the map function.
The first example in mpi4py documentation was an ideal candidate for this purpose for two reasons. First, the example computes the Julia set of dimension 640x480 pixels and dumps it into a PGM file, a highly parallel computation. Second, the code was structured such that replacing mpi4py with multiprocessing required changing only three lines!!
The code derived from this example and modified to work with mpi4py (version 3.0.2) and mulitprocessing (Python v3.7.5) is available here.
Both mpi4py and multiprocessing implementations were configured to use a maximum of 8 processes. To check how the scale of computation affected the performance of these libraries, the implementations calculated the Julia set for an increasing number of pixels. The number of x and y pixels were simultaneously scaled from 1 thru 10, i.e., 640*1x480*1, 640*2x480*2, … 640*10x480*10.
Following are the wall-clock times (in seconds) taken by mpi4py implementation at different scaling factors when executed via
mpiexec —hostfile hostfile -n 9 python3.7 -m mpi4py.futures mpi.py with the host configured with 9 slots.
Scale 1, Time 1.183206s
Scale 2, Time 4.499933s
Scale 3, Time 9.698865s
Scale 4, Time 16.883051s
Scale 5, Time 26.412981s
Scale 6, Time 37.741278s
Scale 7, Time 51.558758s
Scale 8, Time 66.574297s
Scale 9, Time 84.373283s
Scale 10, Time 104.416280s
Following are the wall-clock times (in seconds) taken by multiprocessing implementation at different scaling factors when executed via
python3.7 mp.py .
Scale 1, Time 1.217537s
Scale 2, Time 4.315503s
Scale 3, Time 9.436350s
Scale 4, Time 16.552178s
Scale 5, Time 25.867514s
Scale 6, Time 36.982336s
Scale 7, Time 50.411781s
Scale 8, Time 65.482895s
Scale 9, Time 83.035602s
Scale 10, Time 102.115165s
These numbers are from a single execution. However, they did not change across multiple repetitions.
Are the running times significantly different?
Up to the scaling factor of 5, the difference in running time for both implementations is around half a second. However, beyond the scaling factor of 5, multiprocessing implementation starts to perform better than mpi4py implementation, e.g., at scaling factors 6, 8, and 10, the difference in running times is close to 0.75, 1.1, and 2.3 seconds, respectively.
So, depending on the scale of computation, multiprocessing may be more performance than mpi4py.
Does the number of processes affect the difference in performance?
To answer this question, I reran the experiment with the maximum number of processors pegged to 4 by provided 4 as the
max_workers argument to
MPIPoolExecutor in mpi4py implementation and providing 4 as the
processes argument to
Pool in multiprocessing implementation.
The following are the running times for mpi4py implementation when executed via
mpiexec —hostfile hostfile -n 5 python3.7 -m mpi4py.futures mpi.py with the host configured with 5 slots.
Scale 1, Time 1.593766s
Scale 2, Time 5.825521s
Scale 3, Time 12.060571s
Scale 4, Time 20.631747s
Scale 5, Time 31.711070s
Scale 6, Time 44.379169s
Scale 7, Time 59.009832s
Scale 8, Time 77.066597s
Scale 9, Time 94.842155s
Scale 10, Time 117.034372s
Following are the running times for multiprocessing implementation.
Scale 1, Time 1.315367s
Scale 2, Time 4.512450s
Scale 3, Time 9.426444s
Scale 4, Time 16.534967s
Scale 5, Time 26.249483s
Scale 6, Time 37.269908s
Scale 7, Time 49.685339s
Scale 8, Time 64.522356s
Scale 9, Time 82.165439s
Scale 10, Time 101.421775s
In this case, the running time of mpi4py implementation increases by 10% across almost all scaling factors. A possible reason for this could be increased interprocess communication. Specifically, with a fixed chunk size, as the number of processes decreases, the number of message exchanges between the master and the workers increases.
In comparison, the running time of multiprocessing implementation surprisingly remains unchanged despite increased interprocess communication. A possible reason for this could be the multiprocessing module determines the chunk size depending on the data and the number of processes, which may reduce the communication overhead. [Related post]
The simple experiment considered the task of local parallel processing of data. In this context, off the shelf, multiprocessing module performed better than mpi4py.