Всесторонний анализ качества работы больших суперкомпьютерных комплексов

V.V. Voevodin

doi:10.26089/NumMet.v20r317

https://doi.org/10.26089/NumMet.v20r317

A comprehensive analysis of performance quality of large supercomputer complexes

Authors

V.V. Voevodin

Keywords:

supercomputer

parallel computing

supercomputer applications

performance

efficiency analysis

monitoring data

Abstract

Currently, the problem of low performance of supercomputer complexes is largely due to the fact that administrators of such complexes cannot always timely detect and eliminate the root causes of reduced efficiency. This largely concerns not the equipment failure (such cases can usually be detected using monitoring systems), but an implicit performance decrease of certain supercomputer components, provided that they seems to continue working correctly. Such a situation arises because there are no sufficiently flexible and convenient software tools for prompt and comprehensive analysis of all the performance quality characteristics of computer systems at the moment. The existing solutions either allow analyzing only a small part of such characteristics or are made as non-universal solutions that satisfy only a small set of specific needs provided by administrators of a particular system. This paper describes a systematic approach to solving this issue, which will allow one to perform a comprehensive analysis of various aspects of supercomputer functioning, primarily related to the execution of supercomputer applications. A software tool developed on the basis of this approach will collect, within a single model, all the most important data on the properties and quality of jobs running on the supercomputer mdash; data on their execution performance, size and duration, presence of specific or abnormal behavior scenarios, the usage of application packages and libraries, etc. Using flexible aggregation capabilities, the required level of detail will be specified mdash; individual users, projects, application packages, subject areas, supercomputer partitions, time ranges, etc. This will allow one to create hundreds and thousands of different views for analyzing the state of the supercomputer, which will help administrators to choose the most suitable option for them.

Downloads

PDF (Русский)

Published

2019-08-19

Issue

Vol. 20 (2019): Issue 3.

Section

Section 1. Numerical methods and applications

Author

V.V. Voevodin

Lomonosov Moscow State University,
Research Computing Center
• Senior Researcher

References

V. Voevodin and V. Voevodin, “Efficiency of Exascale Supercomputer Centers and Supercomputing Education,” in High Performance Computer Applications (Springer, Cham, 2016), Vol. 595, pp. 14-23.
Q. Guan and S. Fu, “Adaptive Anomaly Identification by Exploring Metric Subspace in Cloud Computing Infrastructures,” in Proc. IEEE 32nd Int. Symp. on Reliable Distributed Systems, Braga, Portugal, September 30-October 3, 2013 (IEEE Press, Washington, DC, 2013), pp. 205-214.
S. Fu, “Performance Metric Selection for Autonomic Anomaly Detection on Cloud Computing Systems,” in Proc. IEEE Global Telecommunications Conf., Kathmandu, Nepal, December 5-9, 2011 (IEEE Press, New York, 2011),
doi 10.1109/GLOCOM.2011.6134532
O. Ibidunmoye, F. Hernández-Rodriguez, and E. Elmroth, “Performance Anomaly Detection and Bottleneck Identification,” ACM Comput. Surv. 48 (2015).
doi 10.1145/2791120
O. Tuncer, E. Ates, Y. Zhang, et al., “Diagnosing Performance Variations in HPC Applications Using Machine Learning,” in Lecture Notes in Computer Science (Springer, Cham, 2017), Vol. 10266, pp. 355-373.
Z. Lan, Z. Zheng, and Y. Li, “Toward Automated Anomaly Identification in Large-Scale Systems,” IEEE Trans. Parallel Distrib. Syst. 21 (2), 174-187 (2010).
M. D. Jones, J. P. White, M. Innus, et al., “Workload Analysis of Blue Waters,” arXiv preprint: 1703.00924v1 [cs.DC] (Cornell Univ. Library, Ithaca, 2017), available at
https://arxiv.org/abs/1703.00924
M. J. Abraham, T. Murtola, R. Schulz, et al., “GROMACS: High Performance Molecular Simulations through Multi-Level Parallelism from Laptops to Supercomputers,” SoftwareX 1-2}, 19-25 (2015).
K. Agrawal, M. R. Fahey, R. McLay, and D. James, “User Environment Tracking and Problem Detection with XALT,” in Proc. First Int. Workshop on HPC User Support Tools, New Orleans, USA, November 21-21, 2014 (IEEE Press, Piscataway, 2014), pp. 32-40.
D. Shaykhislamov and V. Voevodin, “An Approach for Dynamic Detection of Inefficient Supercomputer Applications,” Procedia Comput. Sci. 136, 35-43 (2018).
P. Shvets, V. Voevodin, and S. Zhumatiy, “Primary Automatic Analysis of the Entire Flow of Supercomputer Applications,” in Proc. 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists, Yekaterinburg, Russia, November 15, 2018. CEUR Workshop Proc. Vol. 2281, 20-32 (2018).
N. A. Simakov, J. P. White, R. L. DeLeon, et al., “A Workload Analysis of NSF’s Innovative HPC Resources Using XDMoD,” arXiv preprint: 1801.04306v1 [cs.DC] (Cornell Univ. Library, Ithaca, 2018), available at
https://arxiv.org/abs/1801.04306
K. Asanović, R. Bodik, B. C. Catanzaro, et al., The Landscape of Parallel Computing Research: A View from Berkeley , Report UCB/EECS-2006-183 (Univ. of California, Berkeley, 2006).
Grafana: The Open Platform for Beautiful Analytics and Monitoring.
https://grafana.com . Cited May 28, 2019.
D3.js: Data-Driven Documents.
https://d3js.org . Cited May 28, 2019.