DOI: https://doi.org/10.26089/NumMet.v24r103

Development of a portable software solution for monitoring and analyzing the performance of supercomputer applications

Authors

  • Vadim V. Voevodin
  • Konstantin S. Stefanov

Keywords:

parallel computing
supercomputer
monitoring
data analysis
performance
portability

Abstract

Modern supercomputers are used in various areas of science and technology. However, their computational resources are often not fully utilized. The reason often lies in the low efficiency of user applications. However, it is very difficult to solve this problem, which is due both to the extreme complexity of the structure of modern supercomputers, and to the lack of theoretical knowledge and practical experience in creating highly efficient parallel applications among users of computing systems. Moreover, users are often not even aware that their applications are not working efficiently. Therefore, it is important for supercomputer administrators to be able to constantly monitor and analyze the entire flow of running applications. For these purposes, different existing systems for monitoring and analyzing performance can be used, however, most of such solutions do not provide sufficient functionality for studying performance, or are not portable. This paper describes a prototype of the software package being developed, which provides wide opportunities for collecting and automatically analyzing application performance data and is portable at the same time.


Published

2023-01-31

Issue

Section

Parallel software tools and technologies

Author Biographies

Vadim V. Voevodin

Konstantin S. Stefanov


References

  1. High Performance Computing Market Size to Surpass USD 64.65 Bn by 2030.
    https://www.globenewswire.com/news-release/2022/04/04/2415844/0/en/High-Performance-Computing-Market-Size-to-Surpass-USD-64-65-Bn-by-2030.html . Cited January 3, 2023.
  2. Yu. Belkina and D. Nikitenko, “Computing Cost and Accounting Challenges for Octoshell Management System,” in Proc. 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists, Yekaterinburg, Russia, November 15, 2018. CEUR Workshop Proc. 2281, 146-158 (2018).
    http://ceur-ws.org/Vol-2281/paper-15.pdf.
  3. D. A. Nikitenko, P. A. Shvets, and V. V. Voevodin, “Why Do Users Need to Take Care of Their HPC Applications Efficiency?,” Lobachevskii J. Math. 41 (8), 1521-1532 (2020).
    doi 10.1134/s1995080220080132.
  4. K. S. Stefanov, S. Pawar, A. Ranjan, et al., “A Review of Supercomputer Performance Monitoring Systems,” Supercomput. Front. Innov. 8 (3), 62-81 (2021).
    doi 10.14529/jsfi210304.
  5. Performance Co-Pilot.
    http://pcp.io/. Cited January 4, 2023.
  6. T. Röhl, J. Eitzinger, G. Hager, and G. Wellein, “LIKWID Monitoring Stack: A Flexible Framework Enabling Job Specific Performance Monitoring for the Masses,” in Proc. 2017 IEEE Int. Conf. on Cluster Computing (CLUSTER), Honolulu, USA, September 5-8, 2017 (IEEE Press, New York, 2017), pp. 781-784.
    doi 10.1109/CLUSTER.2017.115.
  7. M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia Distributed Monitoring System: Design, Implementation, and Experience,” Parallel Comput. 30 (7), 817-840 (2004).
    doi 10.1016/j.parco.2004.04.001.
  8. J. M. Brandt, B. J. Debusschere, A. C. Gentile, et al., “Ovis-2: A Robust Distributed Architecture for Scalable RAS,” in Proc. IEEE Int. Symp. on Parallel and Distributed Processing, Miami, USA, April 14—18, 2008 (IEEE Press, New York, 2008),
    doi 10.1109/IPDPS.2008.4536549.
  9. M. D. Jones, J. P. White, M. Innus, et al., Workload Analysis of Blue Waters , arXiv preprint: 1703.00924v1 [cs.DC] (Cornell Univ. Library, Ithaca, 2017).
    https://arxiv.org/abs/1703.00924 . Cited January 4, 2023.
  10. N. A. Simakov, J. P. White, R. L. DeLeon, et al., A Workload Analysis of NSF’s Innovative HPC Resources Using XDMoD , arXiv preprint: 1801.04306v1 [cs.DC] (Cornell Univ. Library, Ithaca, 2018).
    https://arxiv.org/abs/1801.04306 . Cited January 4, 2023.
  11. D. L. Hart, “Measuring TeraGrid: Workload Characterization for a High-Performance Computing Federation,” Int. J. High Perform. Comput. Appl. 25 (4), 451-465 (2011).
    doi 10.1177/1094342010394382.
  12. S. M. Gallo, J. P. White, R. L. DeLeon, et al., “Analysis of XDMoD/SUPReMM Data Using Machine Learning Techniques,” in 2015 IEEE Int. Conf. on Cluster Computing, Chicago, USA, September 8-11, 2015 (IEEE Press, New York, 2015), pp. 642-649.
    doi 10.1109/CLUSTER.2015.114.
  13. J. T. Palmer, S. M. Gallo, T. R. Furlani, et al., “Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources,” Comput. Sci. Eng. 17 (4), 52-62 (2015).
    doi 10.1109/MCSE.2015.68.
  14. T. Evans, W. L. Barth, J. C. Browne, et al., “Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats,” in Proc. First Int. Workshop on HPC User Support Tools, New Orleans, USA, November 21-21, 2014 (IEEE Press, New York, 2014), pp. 13-21.
    doi 10.1109/HUST.2014.7.
  15. P. Kostenetskiy, A. Shamsutdinov, R. Chulkevich, et al., “HPC TaskMaster -- Task Efficiency Monitoring System for the Supercomputer Center,” in Communications in Computer and Information Science (Springer, Cham, 2022), Vol. 1618, pp. 17-29.
    doi 10.1007/978-3-031-11623-0_2.
  16. K. Stefanov, Vl. Voevodin, S. Zhumatiy, and Vad. Voevodin, “Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon),” Procedia Comput. Sci. 66, 625-634 (2015).
    doi 10.1016/j.procs.2015.11.071.
  17. P. Shvets, V. Voevodin, and S. Zhumatiy, “HPC Software for Massive Analysis of the Parallel Efficiency of Applications,” in Communications in Computer and Information Science (Springer, Cham, 2019), Vol. 1063, pp. 3-18.
  18. P. Shvets, V. Voevodin, and S. Zhumatiy, “Primary Automatic Analysis of the Entire Flow of Supercomputer Applications,” in Proc. 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists, Yekaterinburg, Russia, November 15, 2018. CEUR Workshop Proc. 2281, 20-32 (2018).
    http://ceur-ws.org/Vol-2281/paper-03.pdf.
  19. D. Nikitenko, A. Antonov, P. Shvets, et al., “JobDigest -- Detailed System Monitoring-Based Supercomputer Application Behavior Analysis,” in Communications in Computer and Information Science (Springer, Cham, 2017), Vol. 793, pp. 516-529.
    doi 10.1007/978-3-319-71255-0_42.