An asynchronous sparse Cholesky solver for workstations with NUMA architecture
Authors
-
Andrei S. Maslov
-
Michail M. Makarov
-
Nikolay N. Potravkin
-
Svyatoslav O. Proskurnia
Keywords:
Cholesky factorization
NUMA architecture
asynchronous task paradigm
directed acyclic graph
hwloc library
Abstract
A parallel Cholesky factorization algorithm for sparse matrices has been implemented, based on the asynchronous task paradigm and accounting for the specifics of NUMA architecture. The execution of the numerical factorization stage and forward/backward substitution is represented as a directed acyclic graph (DAG), which removes synchronization barriers and enhances data access locality to improve the utilization efficiency of the computational device’s memory hierarchy. Performance evaluation demonstrates good scalability compared to the highly optimized commercial package Intel MKL PARDISO, confirming the effectiveness of the proposed approach.
Section
Methods and algorithms of computational mathematics and their applications
Author Biographies
Andrei S. Maslov
LLC “TS INTEGRATION”
• Developer
Michail M. Makarov
LLC “TS INTEGRATION”
• Head of Department
Nikolay N. Potravkin
LLC “TS INTEGRATION”
• Lead Developer
Svyatoslav O. Proskurnia
LLC “TS INTEGRATION”
• Lead Developer
References
- O. Schenk, K. G854rtner, W. Fichtner, and A. Stricker, “PARDISO: A High-Performance Serial and Parallel Sparse Linear Solver in Semiconductor Device Simulation,” Future Gener. Comput. Syst. 18 (1), 69-78 (2001).
doi 10.1016/S0167-739X(00)00076-5
- U. Trottenberg, C. W. Oosterlee, and A. Schüller, Multigrid (Academic Press, London, 2001).
- P. R. Amestoy, I. S. Duff, J.-Y. L’Excellent, and J. Koster, “A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling,” SIAM J. Matrix Anal. Appl. 23 (1), 15-41 (2001).
doi 10.1137/S0895479899358194
- P. Hénon, P. Ramet, and J. Roman, “PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Positive Definite Systems,” Parallel Comput. 28 (2), 301-321 (2002).
doi 10.1016/S0167-8191(01)00141-7
- M. Jacquelin, Y. Zheng, E. Ng, and K. Yelick, “An Asynchronous Task-based Fan-Both Sparse Cholesky Solver,” arXiv: 1608.00044v2 (2016).
doi 10.48550/arXiv.1608.00044
- G. Ballard, E. Carson, J. Demmel, et al., “Communication Lower Bounds and Optimal Algorithms for Numerical Linear Algebra,” Acta Numer. 23, 1-155 (2014).
doi 10.1017/S0962492914000038
- M. Bollhöfer, O. Schenk, R. Janalik, et al., “State-of-the-Art Sparse Direct Solvers,” in Parallel Algorithms in Computational Science and Engineering (Birkh854user, Cham, 2020), pp. 3-33.
doi 10.1007/978-3-030-43736-7_1
- oneAPI Threading Building Blocks (oneTBB).
https://www.threadingbuildingblocks.org/.
- O. Schenk and K. G854rtner, “Two-Level Dynamic Scheduling in PARDISO: Improved Scalability on Shared Memory Multiprocessing Systems,” Parallel Comput. 28 (2), 187-197 (2002).
doi 10.1016/S0167-8191(01)00135-1
- G. Karypis and V. Kumar, “A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs,” SIAM J. Sci. Comput. 20 (1), 359-392 (1998).
doi 10.1137/S1064827595287997
- C. Chevalier and F. Pellegrini, “PT-Scotch: A Tool for Efficient Parallel Graph Ordering,” Parallel Comput. 34 (6-8), 318-331 (2008).
doi 10.1016/j.parco.2007.12.001
- J. W. H. Liu, “The Role of Elimination Trees in Sparse Factorization,” SIAM J. Matrix Anal. Appl. 11 (1), 134-172 (1990).
doi 10.1137/0611010
- C. C. Ashcraft, A Taxonomy of Column-Based Cholesky Factorizations , PhD Thesis in Mathematics (Yale University, New Haven, US, 1996).
https://dl.acm.org/doi/book/10.5555/241365.
- F. Broquedis, J. Clet-Ortega, S. Moreaud, et al., “hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications,” 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing (2010).
doi 10.1109/PDP.2010.67
- T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Trans. Math. Softw. 38 (1), Article No 1, 1-25 (2011).
doi 10.1145/2049662.2049663