DOI: https://doi.org/10.26089/NumMet.v24r101

Bottlenecks in organizing the workflows of large HPC centers

Authors

  • Dmitry A. Nikitenko

Keywords:

supercomputing
provision of computing resources
use of computing resources
workflows at supercomputer center
shared research facilities
provision of computing services

Abstract

Effective output from data centers are determined by many complementary factors. Often, attention is paid to only a few, at first glance, the most significant of them. For example, this is the efficiency of the scheduler, the efficiency of resource utilization by user tasks. At the same time, a more general view of the problem is often missed: the level at which the interconnection of work processes in the HPC center is determined, the organization of effective work as a whole. missions at this stage can negate any subtle optimizations at a low level. This paper provides a scheme for describing workflows in the supercomputer center and analyzes the experience of large HPC facilities in identifying the bottlenecks in this chain. A software implementation option that gives the possibility of optimizing the organization of work at all stages is also proposed in the form of a support system for the functioning of the HPC site.


Downloads

Published

2023-01-18

Issue

Section

Parallel software tools and technologies

Author Biography

Dmitry A. Nikitenko


References

  1. Vad. V. Voevodin, D. I. Shaikhislamov, and D. A. Nikitenko, “How to Assess the Quality of Supercomputer Resource Usage,” Supercomput. Front. Innov. 9 (3), 4-18 (2022).
    doi 10.14529/jsfi220301.
  2. R. McLay, K. W. Schulz, W. L. Barth, and T. Minyard, “Best Practices for the Deployment and Management of Production HPC Clusters,” in Proc. Int. Conf. for High Performance Computing, Networking, Storage and Analysis, Seattle, USA, November 12-18, 2011 (ACM Press, New York, 2011),
    doi 10.1145/2063348.2063360.
  3. S. Varrette, P. Bouvry, H. Cartiaux, and F. Georgatos, “Management of an Academic HPC Cluster: The UL Experience,” in Proc. Int. Conf. on High Performance Computing & Simulation, Bologna, Italy, July 21-25, 2014 (IEEE Press, New York, 2014), pp. 959-967.
    doi 10.1109/HPCSim.2014.6903792.
  4. S. Varrette, E. Kieffer, and F. Pinel, “Optimizing the Resource and Job Management System of an Academic HPC & Research Computing Facility,” in Proc. 21st Int. Symposium on Parallel and Distributed Computing, Basel, Switzerland, July 11-13, 2022 (IEEE Press, New York, 2022), pp. 129-137.
    doi 10.1109/ISPDC55340.2022.00027.
  5. Vad. V. Voevodin, R. A. Chulkevich, P. S. Kostenetskiy, et al., “Administration, Monitoring and Analysis of Supercomputers in Russia: a Survey of 10 HPC Centers,” Supercomput. Front. Innov. 8 (3), 82-103 (2021).
    doi 10.14529/jsfi210305.
  6. A. V. Paokin and D. A. Nikitenko, “Unified Approach for Provision of Supercomputer Center Resources,” Vestn. Yuzhn. Ural. Gos. Univ. Ser. Vychisl. Mat. Inf. 11 (1), 5-14 (2022).
    doi 10.14529/cmse220101.
  7. V. Voevodin, A. Antonov, D. Nikitenko, et al., “Lomonosov-2: Petascale Supercomputing at Lomonosov Moscow State University,” in Contemporary High Performance Computing (CRC Press, Boca Raton, 2019), Vol. 3, pp. 305-330.
    doi 10.1201/9781351036863-12.
  8. Vl. V. Voevodin, A. S. Antonov, D. A. Nikitenko, et al., “Supercomputer Lomonosov-2: Large Scale, Deep Monitoring and Fine Analytics for the User Community,” Supercomput. Front. Innov. 6 (2), 4-11 (2019).
    doi 10.14529/jsfi190201.
  9. G. I. Savin, B. M. Shabanov, P. N. Telegin, and A. V. Baranov, “Joint Supercomputer Center of the Russian Academy of Sciences: Present and Future,” Lobachevskii J. Math. 40 (11), 1853-1862 (2019).
    doi 10.1134/S1995080219110271.
  10. P. S. Kostenetskiy, R. A. Chulkevich, and V. I. Kozyrev, “HPC Resources of the Higher School of Economics,” J. Phys.: Conf. Ser. 1740 (1), Article 012050 (2021).
    doi 10.1088/1742-6596/1740/1/012050.
  11. D. A. Nikitenko, Vad. V. Voevodin, and S. A. Zhumatiy, “Driving a Petascale HPC Center with Octoshell Management System,” Lobachevskii J. Math. 40 (11), 1817-1830 (2019).
    doi 10.1134/S1995080219110192.
  12. P. Shvets, Vad. Voevodin, and D. Nikitenko, “Approach to Workload Analysis of Large HPC Centers,” in Communications in Computer and Information Science (Springer, Cham, 2020), Vol. 1263, pp. 16-30.
    doi 10.1007/978-3-030-55326-5_2.
  13. D. A. Nikitenko, P. A. Shvets, and V. V. Voevodin, “Why do Users Need to Take Care of Their HPC Applications Efficiency?,” Lobachevskii J. Math. 41 (8), 1521-1532 (2020).
    doi 10.1134/s1995080220080132.