Shared memory based MPI Reduce and Bcast algorithms


  • Alexey A. Romanyuta
  • Mikhail G. Kurnosov


collective operations
computer systems


Algorithms for implementing collective operations MPI_Bcast, MPI_Reduce, MPI_Allreduce using shared memory of multiprocessor servers are proposed. The algorithms create a shared memory segment and a system of queues in it, through which message blocks are transmitted. The software implementation is based on the Open MPI library as an isolated coll/sharm component. Unlike existing algorithms, interaction with the queuing system is organized with spinlock and focused on reducing the number of barrier synchronizations and atomic operations. When conducting experiments on a server with x86–64 architecture for the MPI_Bcast operation, the largest reduction in time was obtained by 6.5 times (85% less) and MPI_Reduce by 3.3 times (70% less) compared to the implementation in the coll/tuned component of the Open MPI library. Recommendations on the use of algorithms for different message sizes are suggested.





Parallel software tools and technologies

Author Biographies

Alexey A. Romanyuta

Mikhail G. Kurnosov


  1. MPI: A Message-Passing Interface Standard. Version 4.0. . Cited September 21, 2023.
  2. S. Jain, R. Kaleem, M. G. Balmana, et al., “Framework for Scalable Intra-Node Collective Operations Using Shared Memory,” in Proc. Int. Conf. on High Performance Computing, Networking, Storage, and Analysis, Dallas, USA, November 11-16, 2018 (IEEE Press, Piscataway, 2018), Vol. 1, pp. 374-385.
    doi  10.1109/SC.2018.00032
  3. J. S. Ladd, M. G. Venkata, P. Shamis, and R. L. Graham, “Collective Framework and Performance Optimizations to Open MPI for Cray XT Platforms,” in Proc. 53rd Cray User Group Meeting, Fairbanks, Alaska, USA, May 23-26, 2011. . Cited September 22, 2023.
  4. High-Performance Portable MPI. September 22, 2023.
  5. Cross Memory Attach. September 22, 2023.
  6. High-Performance Intra-Node MPI Communication. . Cited September 22, 2023.
  7. B. Goglin and S. Moreaud, “KNEM: a Generic and Scalable Kernel-Assisted Intra-Node MPI Communication Framework,” J. Parallel Distrib. Comput. 73 (2), 176-188 (2013).
    doi 10.1016/j.jpdc.2012.09.016
  8. Linux Cross-Memory Attach. . Cited September 22, 2023.
  9. Open Source High Performance Computing. . Cited September 22, 2023.
  10. R. L. Graham and G. Shipman, “MPI Support for Multi-Core Architectures: Optimized Shared Memory Collectives,” in Lecture Notes in Computer Science (Springer, Heidelberg, 2008), Vol. 5205, pp. 130-140.
    doi 10.1007/978-3-540-87475-1_21.
  11. MVAPICH: MPI over InfiniBand, Omni-Path, Ethernet/iWARP, RoCE, and Slingshot. Cited September 22, 2023.
  12. M. Kurnosov and E. Tokmasheva, “Shared Memory Based MPI Broadcast Algorithms for NUMA Systems,” in Communications in Computer and Information Science (Springer, Cham, 2020), Vol. 1331, pp. 473-485.
    doi 10.1007/978-3-030-64616-5_41
  13. S. Li, T. Hoefler, and M. Snir, “NUMA-Aware Shared-Memory Collective Communication for MPI,” in Proc. 22nd Int. Symposium on High-Performance Parallel and Distributed Computing, New York, USA, June 17-21, 2013 (ACM Press, New York, 2013), pp. 85-96.
    doi 10.1145/2462902.2462903
  14. M. G. Kurnosov, “Analysis of the Scalability of Collective Exchange Algorithms on Distributed Computing Systems,” in Proc. 4th All-Russian Scientific and Technical Conf. on Supercomputer Technologies, Rostov-on-Don, Russia, September 19-24, 2016 (Southern Federal Univ. Press, Rostov-on-Don, 2016), Vol. 2, pp. 48-52 [in Russian].