Automatic detection and description of supercomputer network infrastructure

Authors

  • Vad.V. Voevodin
  • K.S. Stefanov

Keywords:

supercomputers
parallel computing
supercomputer topology
communication networks
network topology detection
Ethernet
Infiniband
SNMP protocol

Abstract

The supercomputing system performance increases each year. In particular, this is because of increasing the number of computational nodes, making the memory subsystem and communication network more complex, etc., which causes the reduction of reliability and system’s efficiency. As a result, the on-line control and efficient autonomous functioning of supercomputer complexes become more and more important. In order to solve this problem, Moscow University Research Computing Center is currently developing the Octotron system whose main objective is to provide the maximum safety and fullest usage of the hardware. The Octotron system uses a model of a computer system that should contain main supercomputer components and their interconnection. In particular, this model should contain a description of communication networks. Such a description could be significantly complex in many cases; therefore, the automation of this process is needed. In this paper we describe a programming tool being developed in order to detect Ethernet and Infiniband network topology in supercomputer systems. For detecting Ethernet network topology, this tool collects SNMP data from all available nodes and modifies them on the basis of the proposed rules to achieve more precise results. In the case of Infiniband network, it collects the necessary information from the subnet manager. The results of using this tool on the Lomonosov and Chebyshev supercomputers installed at Moscow University are discussed.


Published

2014-09-23

Issue

Section

Section 1. Numerical methods and applications

Author Biographies

Vad.V. Voevodin

K.S. Stefanov


References

  1. Антонов А.С., Воеводин Вад.В., Воеводин Вл.В., Жуматий С.А., Никитенко Д.А., Соболев С.И., Стефанов К.С., Швец П.А. Разработка принципов построения и реализация прототипа системы обеспечения оперативного контроля и эффективной автономной работы суперкомпьютерных комплексов // Вестн. Уфимского гос. авиационного техн. ун-та. 2014. 18, № 2. 227-236.
  2. Исходный код текущей версии проекта «Октотрон» (https://github.com/srcc-msu/octotron_core).
  3. Рабочее окружение для создания модели на языке Python в рамках проекта «Октотрон» // (https://github.com/srcc-msu/octotron).
  4. Антонов А.С. СКИФ МГУ - основа Суперкомпьютерного комплекса Московского университета // Вторая Международная научная конференция «Суперкомпьютерные системы и их применение» (SSA’2008). Минск: ОИПИ НАН Беларуси, 2008. 7-10.
  5. Breitbart Y., Garofalakis M., Martin C., Rastogi R., Seshadri S., Silberschatz A. Topology discovery in heterogeneous IP networks // Proc. IEEE INFOCOM 2000. Vol. 1. New York: IEEE Press, 2000. 265-274.
  6. Breitbart Y., Garofalakis M., Jai B., Martin C., Rastogi R., Silberschatz A. Topology discovery in heterogeneous IP networks: the NetInventory system // IEEE/ACM Trans. on Networking. 2004. 12, N 3. 401-414.
  7. Lowekamp B., O’Hallaron D., Gross T. Topology discovery for large Ethernet networks // Proc. ACM SIGCOMM 2001. San Diego: ACM Press, 2001. 237-248.
  8. Gobjuka H., Breitbart Y. Ethernet topology discovery for networks with incomplete information // Proc. IEEE ICCCN 2007. New York: IEEE Press, 2007. 631-638.
  9. Bejerano Y., Breitbart Y., Garofalakis M., Rastogi R. Physical topology discovery for large multisubnet networks // Proc. IEEE INFOCOM 2003. Vol. 1. New York: IEEE Press, 2003. 342-352.
  10. Bejerano Y. Taking the skeletons out of the closets: a simple and efficient topology discovery scheme for large multisubnet networks // Proc. IEEE INFOCOM 2006. New York: IEEE Press, 2006. 1-13.
  11. Bejerano Y. Taking the skeletons out of the closets: a simple and efficient topology discovery scheme for large Ethernet LANs // IEEE/ACM Trans. on Networking. 2009. 17, N 5. 1385-1398.
  12. Breitbart Y., Gobjuka H. Characterization of layer-2 unique topologies // Information Processing Letters. 2008. 105, N 2. 52-57.
  13. Gobjuka H., Breitbart Y. Finding Ethernet-type network topology is not easy. Technical Report N TR-KSU-CS-2007-03. Kent: Kent State Univ., 2007.
  14. Gobjuka H., Breitbart Y. Ethernet topology discovery for networks with incomplete information // IEEE/ACM Trans. on Networking. 2010. 18, N 4. 1220-1233.
  15. Black R., Donnelly A., Fournet C. Ethernet topology discovery without network assistance // Proc 12th IEEE Int. Conf. on Network Protocols (ICNP 2004). New York: IEEE Press, 2004. 328-339.
  16. Hasegawa Y., Jibiki M. Ethernet topology detection from a single host without assistance of network nodes or other hosts // IEICE Trans. on Communications. 2009. 92, N 4. 1128-1136.
  17. Обзор продукта CiscoWorks Campus Manager 5.0 // (http://www.cisco.com/c/en/us/products/collateral/cloud-systemsmanagement/ciscoworks-campus-manager-5-0/ // product_data_sheet0900aecd8063af4d.html).
  18. Обзор продукта HP Network Node Manager // (http://www8.hp.com/us/en/software-solutions/network-node-manager-i-networkmanagement-software/).
  19. Документация по программному средству Netdisco (http://www.netdisco.org/readme.html).
  20. Документация по программному средству Zabbix (https://www.zabbix.com/documentation/ru/start).
  21. Документация по программному средству Nagios (http://www.nagios.org/documentation).
  22. Case J., Fedor M., Schoffstall M., Davin J. The simple network management protocol. STD 15, RFC 1157 // (http://tools.ietf.org/html/rfc1157).
  23. Описание протокола LLDP (формально утвержден как IEEE 802.1AB-2009) // (http://standards.ieee.org/findstds/standard/802.1AB-2009.html).
  24. Документация по программному средству Graphviz (http://www.graphviz.org/Documentation.php).
  25. Воеводин Вл.В., Жуматий С.А., Соболев С.И., Антонов А.С., Брызгалов П.А., Никитенко Д.А., Стефанов К.С., Воеводин Вад.В. Практика суперкомпьютера «Ломоносов» // Открытые системы. 2012. № 7. 36-39.