Shared memory based MPI Reduce and Bcast algorithms
Authors
-
Alexey A. Romanyuta
-
Mikhail G. Kurnosov
Keywords:
Bcast
Reduce
Allreduce
collective operations
MPI
computer systems
Abstract
Algorithms for implementing collective operations MPI_Bcast, MPI_Reduce, MPI_Allreduce using shared memory of multiprocessor servers are proposed. The algorithms create a shared memory segment and a system of queues in it, through which message blocks are transmitted. The software implementation is based on the Open MPI library as an isolated coll/sharm component. Unlike existing algorithms, interaction with the queuing system is organized with spinlock and focused on reducing the number of barrier synchronizations and atomic operations. When conducting experiments on a server with x86–64 architecture for the MPI_Bcast operation, the largest reduction in time was obtained by 6.5 times (85% less) and MPI_Reduce by 3.3 times (70% less) compared to the implementation in the coll/tuned component of the Open MPI library. Recommendations on the use of algorithms for different message sizes are suggested.
Section
Parallel software tools and technologies
References
- MPI: A Message-Passing Interface Standard. Version 4.0.
http://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf . Cited September 21, 2023.
- S. Jain, R. Kaleem, M. G. Balmana, et al., “Framework for Scalable Intra-Node Collective Operations Using Shared Memory,” in Proc. Int. Conf. on High Performance Computing, Networking, Storage, and Analysis, Dallas, USA, November 11-16, 2018 (IEEE Press, Piscataway, 2018), Vol. 1, pp. 374-385.
doi 10.1109/SC.2018.00032
- J. S. Ladd, M. G. Venkata, P. Shamis, and R. L. Graham, “Collective Framework and Performance Optimizations to Open MPI for Cray XT Platforms,” in Proc. 53rd Cray User Group Meeting, Fairbanks, Alaska, USA, May 23-26, 2011.
https://www.ornl.gov/publication/collective-framework-and-performance-optimizations-open-mpi-cray-xt-platforms . Cited September 22, 2023.
- High-Performance Portable MPI.
https://www.mpich.org/.Cited September 22, 2023.
- Cross Memory Attach.
https://lwn.net/Articles/405284/.Cited September 22, 2023.
- High-Performance Intra-Node MPI Communication.
https://knem.gitlabpages.inria.fr . Cited September 22, 2023.
- B. Goglin and S. Moreaud, “KNEM: a Generic and Scalable Kernel-Assisted Intra-Node MPI Communication Framework,” J. Parallel Distrib. Comput. 73 (2), 176-188 (2013).
doi 10.1016/j.jpdc.2012.09.016
- Linux Cross-Memory Attach.
https://github.com/hjelmn/xpmem . Cited September 22, 2023.
- Open Source High Performance Computing.
http://www.open-mpi.org . Cited September 22, 2023.
- R. L. Graham and G. Shipman, “MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives,” in Lecture Notes in Computer Science (Springer, Heidelberg, 2008), Vol. 5205, pp. 130-140.
doi 10.1007/978-3-540-87475-1_21.
- MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot.
https://mvapich.cse.ohio-state.edu/. Cited September 22, 2023.
- M. Kurnosov and E. Tokmasheva, “Shared Memory Based MPI Broadcast Algorithms for NUMA Systems,” in Communications in Computer and Information Science (Springer, Cham, 2020), Vol. 1331, pp. 473-485.
doi 10.1007/978-3-030-64616-5_41
- S. Li, T. Hoefler, and M. Snir, “NUMA-Aware Shared-Memory Collective Communication for MPI,” in Proc. 22nd Int. Symposium on High-Performance Parallel and Distributed Computing, New York, USA, June 17-21, 2013 (ACM Press, New York, 2013), pp. 85-96.
doi 10.1145/2462902.2462903
- M. G. Kurnosov, “Analysis of the Scalability of Collective Exchange Algorithms on Distributed Computing Systems,” in Proc. 4th All-Russian Scientific and Technical Conf. on Supercomputer Technologies, Rostov-on-Don, Russia, September 19-24, 2016 (Southern Federal Univ. Press, Rostov-on-Don, 2016), Vol. 2, pp. 48-52 [in Russian].