Bottlenecks in organizing the workflows of large HPC centers
Keywords:supercomputing, provision of computing resources, use of computing resources, workflows at supercomputer center, shared research facilities, provision of computing services
Effective output from data centers are determined by many complementary factors. Often, attention is paid to only a few, at first glance, the most significant of them. For example, this is the efficiency of the scheduler, the efficiency of resource utilization by user tasks. At the same time, a more general view of the problem is often missed: the level at which the interconnection of work processes in the HPC center is determined, the organization of effective work as a whole. missions at this stage can negate any subtle optimizations at a low level. This paper provides a scheme for describing workflows in the supercomputer center and analyzes the experience of large HPC facilities in identifying the bottlenecks in this chain. A software implementation option that gives the possibility of optimizing the organization of work at all stages is also proposed in the form of a support system for the functioning of the HPC site.
- Vad. V. Voevodin, D. I. Shaikhislamov, and D. A. Nikitenko, “How to Assess the Quality of Supercomputer Resource Usage,” Supercomput. Front. Innov. 9 (3), 4-18 (2022).
- R. McLay, K. W. Schulz, W. L. Barth, and T. Minyard, “Best Practices for the Deployment and Management of Production HPC Clusters,” in Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, Seattle, USA, November 12-18, 2011 (ACM Press, New York, 2011),
- S. Varrette, P. Bouvry, H. Cartiaux, and F. Georgatos, “Management of an Academic HPC Cluster: The UL Experience,” in Proc. Int. Conf. on High Performance Computing & Simulation, Bologna, Italy, July 21-25, 2014 (IEEE Press, New York, 2014), pp. 959-967.
- S. Varrette, E. Kieffer, and F. Pinel, “Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility,” in Proc. 21st Int. Symposium on Parallel and Distributed Computing, Basel, Switzerland, July 11-13, 2022 (IEEE Press, New York, 2022), pp. 129-137.
- Vad. V. Voevodin, R. A. Chulkevich, P. S. Kostenetskiy, et al., “Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers,” Supercomput. Front. Innov. 8 (3), 82-103 (2021).
- A. V. Paokin and D. A. Nikitenko, “Unified Approach for Provision of Supercomputer Center Resources,” Vestn. Yuzhn. Ural. Gos. Univ. Ser. Vychisl. Mat. Inf. 11 (1), 5-14 (2022).
- V. Voevodin, A. Antonov, D. Nikitenko, et al., “Lomonosov-2: Petascale Supercomputing at Lomonosov Moscow State University,” in Contemporary High Performance Computing (CRC Press, Boca Raton, 2019), Vol. 3, pp. 305-330.
- Vl. V. Voevodin, A. S. Antonov, D. A. Nikitenko, et al., “Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community,” Supercomput. Front. Innov. 6 (2), 4-11 (2019).
- G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, “Joint Supercomputer Center of the Russian Academy of Sciences: Present and Future,” Lobachevskii J. Math. 40 (11), 1853-1862 (2019).
- P. S. Kostenetskiy, R. A. Chulkevich, and V. I. Kozyrev, “HPC Resources of the Higher School of Economics,” J. Phys.: Conf. Ser. 1740 (1), Article 012050 (2021).
- D. A. Nikitenko, Vad. V. Voevodin, and S. A. Zhumatiy, “Driving a Petascale HPC Center with Octoshell Management System,” Lobachevskii J. Math. 40 (11), 1817-1830 (2019).
- P. Shvets, Vad. Voevodin, and D. Nikitenko, “Approach to Workload Analysis of Large HPC Centers,” in Communications in Computer and Information Science (Springer, Cham, 2020), Vol. 1263, pp. 16-30.
- D. A. Nikitenko, P. A. Shvets, and V. V. Voevodin, “Why do Users Need to Take Care of Their HPC Applications Efficiency?,” Lobachevskii J. Math. 41 (8), 1521-1532 (2020).
How to Cite
Copyright (c) 2023 Д. А. Никитенко
This work is licensed under a Creative Commons Attribution 4.0 International License.