A comprehensive analysis of performance quality of large supercomputer complexes
Keywords:supercomputer, parallel computing, supercomputer applications, performance, efficiency analysis, monitoring data
Currently, the problem of low performance of supercomputer complexes is largely due to the fact that administrators of such complexes cannot always timely detect and eliminate the root causes of reduced efficiency. This largely concerns not the equipment failure (such cases can usually be detected using monitoring systems), but an implicit performance decrease of certain supercomputer components, provided that they seems to continue working correctly. Such a situation arises because there are no sufficiently flexible and convenient software tools for prompt and comprehensive analysis of all the performance quality characteristics of computer systems at the moment. The existing solutions either allow analyzing only a small part of such characteristics or are made as non-universal solutions that satisfy only a small set of specific needs provided by administrators of a particular system. This paper describes a systematic approach to solving this issue, which will allow one to perform a comprehensive analysis of various aspects of supercomputer functioning, primarily related to the execution of supercomputer applications. A software tool developed on the basis of this approach will collect, within a single model, all the most important data on the properties and quality of jobs running on the supercomputer mdash; data on their execution performance, size and duration, presence of specific or abnormal behavior scenarios, the usage of application packages and libraries, etc. Using flexible aggregation capabilities, the required level of detail will be specified mdash; individual users, projects, application packages, subject areas, supercomputer partitions, time ranges, etc. This will allow one to create hundreds and thousands of different views for analyzing the state of the supercomputer, which will help administrators to choose the most suitable option for them.
- Voevodin V., Voevodin V. Efficiency of exascale supercomputer centers and supercomputing education // High Performance Computer Applications. Vol. 595. Cham: Springer, 2016. 14-23.
- Guan Q., Fu S. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures // 2013 IEEE 32nd International Symposium on Reliable Distributed Systems. Washington, DC: IEEE Press, 2013. 205-214.
- Fu S. Performance metric selection for autonomic anomaly detection on cloud computing systems // 2011 IEEE Global Telecommunications Conference. New York: IEEE Press, 2011. doi 10.1109/GLOCOM.2011.6134532.
- Ibidunmoye O., Hernandez-Rodriguez F., Elmroth E. Performance anomaly detection and bottleneck identification // ACM Computing Surveys. 2015. Vol. 48, N 1. doi 10.1145/2791120.
- Tuncer O., Ates E., Zhang Y, et al. Diagnosing performance variations in HPC applications using machine learning // Lecture Notes in Computer Science. Vol. 10266. Cham: Springer, 2017. 355-373.
- Lan Z., Zheng Z., Li Y. Toward automated anomaly identification in large-scale systems // IEEE Transactions on Parallel and Distributed Systems. 2010. Vol. 21, N 2. 174-187.
- Jones M.D., White J.P., Innus M., et al. Workload Analysis of Blue Waters. https://arxiv.org/abs/1703.00924.
- Abraham M.J., Murtola T., Schulz R., et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers // SoftwareX. 2015. Vol. 1-2. 19-25.
- Agrawal K., Fahey M.R., McLay R., James D. User environment tracking and problem detection with XALT // Proceedings of the First International Workshop on HPC User Support Tools. Piscataway: IEEE Press, 2014. 32-40.
- Shaykhislamov D., Voevodin V. An approach for dynamic detection of inefficient supercomputer applications // Procedia Computer Science. 2018. Vol. 136. 35-43.
- Shvets P., Voevodin V., Zhumatiy S. Primary automatic analysis of the entire flow of supercomputer applications // Proceedings of the 4th Ural Workshop on Parallel, Distributed, and Cloud Computing for Young Scientists. CEUR Workshop Proceedings. Vol. 2281. 2018. 20-32.
- Simakov N.A., White J.P., DeLeon R.L., et al. A workload analysis of NSFs innovative HPC resources using XDMoD. https://arxiv.org/abs/1801.04306.
- Asanovic K., Bodik R., Catanzaro B.C., et al. The landscape of parallel computing research: A view from Berkeley. Technical Report UCB/EECS-2006-183. Berkeley: University of California, 2006.
- Grafana: The Open Platform for Beautiful Analytics and Monitoring. https://grafana.com/.
- D3.js: Data-Driven Documents. https://d3js.org.