New approaches for automatic analysis of HPC application performance using the TASC software suite
Authors
-
Vladimir A. Matveev
-
Alexander V. Setyaev
-
Vadim V. Voevodin
Keywords:
supercomputer
monitoring
performance analysis
HPC applications
supercomputer usage quality
application class
efficiency assessment
Abstract
This paper presents new automatic analysis approaches for identifying performance issues and useful properties in jobs running on a supercomputer. Methods for detecting problematic application classes, such as “hung” programs and jobs with underutilized nodes, are proposed. New assessments for automatic preliminary evaluation of GPU processors and memory usage efficiency are developed and tested as well. These approaches extend the functionality of the existing TASC software suite designed for conducting comprehensive analysis of usage quality of modern supercomputers.
Section
Parallel software tools and technologies
References
- V. V. Voevodin, D. I. Shaikhislamov, and D. A. Nikitenko, “How to Assess the Quality of Supercomputer Resource Usage,” Supercomputing Frontiers and Innovations 9 (3), 4-18 (2022).
doi 10.14529/jsfi220301
- D. A. Nikitenko, P. A. Shvets, and V. V. Voevodin, “Why do Users Need to Take Care of Their HPC Applications Efficiency?’’ Lobachevskii Journal of Mathematics 41 (8), 1521-1532 (2020).
doi 10.1134/s1995080220080132
- V. V. Voevodin, D. I. Shaikhislamov, and V. A. Serov, “TASC Software for HPC Performance Analysis: Current State and Latest Developments,” Bulletin of the South Ural State University Series Computational Mathematics and Software Engineering 13 (3), 61-78 (2024).
doi 10.14529/cmse240304
- P. Shvets, V. Voevodin, and S. Zhumatiy, “Primary Automatic Analysis of the Entire Flow of Supercomputer Applications,” in Proceedings of the 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists, Yekaterinburg, Russia, November 15, 2018 CEUR Workshop Proceedings, Vol. 2281, pp. 20–32.
- P. A. Shvets, and V. V. Voevodin, “ ’Endless’ Workload Analysis of Large-Scale Supercomputers,” Lobachevskii Journal of Mathematics 42 (1), 184–194 (2021).
doi 10.1134/s1995080221010236
- E. Ates, O. Tuncer, A. Turk, et al., “Taxonomist: Application Detection Through Rich Monitoring Data,” in Proceedings of Euro-Par 2018: Parallel Processing, Turin, Italy, August 27-31, 2018 Lecture Notes in Computer Science Vol. 11014, pp. 92–105.
doi 10.1007/978-3-319-96983-1_7
- T. Jakobsche, N. Lachiche, A. Cavelan, and F. M. Ciorba, “An Execution Fingerprint Dictionary for HPC Application Recognition,” in Proceedings of 2021 IEEE International Conference on Cluster Computing (CLUSTER), Portland, USA, September 7-10, 2021 IEEE Press, New York, 2021, pp. 604-608.
doi 10.1109/Cluster48925.2021.00092
- R. D. Lewis, Z. Liu, R. Kettimuthu, and M. E. Papka, “Log-Based Identification, Classification, and Behavior Prediction of HPC Applications,” in Proceedings of HPCSYSPROS’20: HPC System Professionals Workshop, Atlanta, GA, USA, November 11-13, 2020 ACM, New York, 2020, pp. 1-7.
- A. Bezrukov, M. Kokarev, D. Shaykhislamov, V. Voevodin, S. Zhumatiy, “Machine Learning Techniques for Detecting Supercomputer Applications with Abnormal Behavior,” in Proceedings of 12th Int. Conference on Parallel Computational Technologies (PCT 2018), Rostov-on-Don, Russia, April 2–6, 2018 Communications in Computer and Information Science 2018. Vol. 910, pp. 31–46.
doi 10.1007/978-3-319-99673-8_3
- K. Yamamoto, Y. Tsujita, and A. Uno, “Classifying Jobs and Predicting Applications in HPC Systems,” in Proceedings of ISC on High Performance Computing, Frankfurt, Germany, June 24-28, 2018 , Lecture Notes in Computer Science Vol. 10876, pp. 81–99.
doi 10.1007/978-3-319-92040-5_5
- K. Stefanov, Vl. Voevodin, S. Zhumatiy, and V. Voevodin, “Dynamically Reconfigurable Distributed Modular Monitoring System for Supercomputers (DiMMon),” Procedia Computer Science 66, 625–634 (2015).
doi 10.1016/j.procs.2015.11.071
https://doi.org/10.1016/j.procs.2015.11.071Cited November 14, 2025.
- Vl. Voevodin, A. Antonov, D. Nikitenko, et al., “Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community,” Supercomputing Frontiers and Innovations 6 (2), 4–11 (2019).
doi 10.14529/jsfi190201
- Top-down Microarchitecture Analysis Method.
https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html . Cited November 14, 2025.
- NVIDIA Nsight Compute Documentation.
https://docs.nvidia.com/nsight-compute/.Cited November 14, 2025.
- Description of PC Sampling in CUPTI Library.
https://docs.nvidia.com/cupti/main/main.html#cupti-pc-sampling-api . Cited November 14, 2025.
- NVIDIA Management Library (NVML) homepage.
https://developer.nvidia.com/management-library-nvml . Cited November 14, 2025.
- NVIDIA Data Center GPU Manager (DCGM) homepage.
https://developer.nvidia.com/dcgm . Cited November 14, 2025.
- Description of PM Sampling in CUPTI Library.
https://docs.nvidia.com/cupti/main/main.html#cupti-pm-sampling-api . Cited November 14, 2025.
- A. Saiz, P. Prieto, P. Abad, et al., “Top-Down Performance Profiling on NVIDIA’s GPUs,” in Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, May 30–June 3, 2022 IEEE Press, New York, 2022, pp. 179-189.
doi 10.1109/IPDPS53621.2022.00026
- NAS Parallel Benchmarks for GPUs.
https://github.com/GMAP/NPB-GPU . Cited November 14, 2025.
- D. Bailey, T. Harris, W. Saphir, et al., “The NAS Parallel Benchmarks 2.0,” Technical Report NAS-95-020, NASA Ames Research Center 156 (1995).