DOI: https://doi.org/10.26089/NumMet.v23r102

A parallel program model for execution time estimation

Authors

  • Valery А. Antonyuk
  • Nikita G. Mikheev

Keywords:

parallel programs
CUDA
OpenCL
max-plus algebra

Abstract

Programs for general-purpose graphics processing units represented as kernels without indefinite loops are considered in this paper. Such kernels can be implemented by CUDA or OpenCL technologies, for example. For execution time estimation, various models of program execution are introduced: from very “naive” to more reliable. All models are presented in the form of matrix expressions in max-plus algebra.


Published

2022-02-06

Issue

Section

Parallel software tools and technologies

Author Biographies

Valery А. Antonyuk

Nikita G. Mikheev


References

  1. J. Szuppe, “Boost.Compute: A Parallel Computing Library for C++ Based on OpenCL,” in Proc. 4th Int. Workshop on OpenCL, Vienna, Austria, April 19-21, 2016 (ACM Press, New York, 2016),
    doi 10.1145/2909437.2909454.
  2. OpenCV API Reference. GPU-accelerated Computer Vision.
    https://docs.opencv.org/2.4.13.7/modules/gpu/doc/gpu.html . Cited January 5, 2022.
  3. L. B. Bosi, M. Mariotti, and A. Santocchia, “GPU Linear Algebra Extensions for GNU/Octave,” J. Phys. Conf. Ser. 368 (1) (2012),
    doi 10.1088/1742-6596/368/1/012062.
  4. FFmpeg Hardware Acceleration.
    https://trac.ffmpeg.org/wiki/HWAccelIntro . Cited January 5, 2022.
  5. M. Abadi, P. Barham, J. Chen, et al., TensorFlow: A System for Large-Scale Machine Learning. ArXiv preprint: 1605.08695 [cs.DC](Cornell Univ. Library, Ithaca, 2016).
    https://arxiv.org/abs/1605.08695 . Cited January 5, 2022.
  6. R. Vuduc, A. Chandramowlishwaran, J. Choi, et al., “On the Limits of GPU Acceleration,” in Proc. 2nd USENIX Conf. on Hot Topics in Parallelism, Berkeley, USA, June 14-15, 2010 (USENIX Association, Berkeley, 2010),
    doi 10.5555/1863086.1863099.
  7. CUDA C++ Programming Guide PG-02829-001_v11.5.
    https://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf . Cited January 5, 2022.
  8. The OpenCL Specification. Khronos OpenCL Working Group. Version V3.0.10, 19 Nov 2021.
    https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/OpenCL_API.html . Cited January 5, 2022.
  9. A. A. Kleymenov and N. N. Popova, “A Method for Prediction Dynamic Characteristics of Parallel Programs Based on Static Analysis,” Vestn. Yuzhn. Ural. Gos. Univ. Ser. Vychisl. Mat. Inf., 10 (1), 20-31 (2021).
    doi 10.14529/cmse210102.
  10. A. A. Kleimenov and N. N. Popova, “A Method for Prediction Execution Time of GPU Programs,” Comp. Nanotechnol. 8 (1), 38-45 (2021).
    doi 10.33693/2313-223X-2021-8-1-38-45.
  11. K. Kothapalli, R. Mukherjee, M. S. Rehman, et al., “A Performance Prediction Model for the CUDA GPGPU Platform,” in Proc. 2009 Int. Conf. on High Performance Computing, Kochi, India, December 16-19, 2009 (IEEE Press, New York, 2009), pp. 463-472,
    doi 10.1109/HIPC.2009.5433179.
  12. L. G. Valiant, “A Bridging Model for Parallel Computation,” Commun. ACM 33 (8), 103-111 (1990).
    doi 10.1145/79173.79181.
  13. S. Fortune and J. Wyllie, “Parallelism in Random Access Machines,” in Proc. 10th ACM Symposium on Theory of Computing, San Diego, USA, May 1-3, 1978 (ACM Press, New York, 1978), pp. 114-118.
    doi 10.1145/800133.804339.
  14. P. B. Gibbons, Y. Matias, and V. Ramachandran, “The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms,” SIAM J. Comput. 28 (2), 733-769 (1998).
  15. S. S. Baghsorkhi, M. Delahaye, S. J. Patel, et al., “An Adaptive Performance Modeling Tool for GPU Architectures,” ACM SIGPLAN Not. 45 (5), 105-114 (2010).
    doi 10.1145/1837853.1693470.
  16. S. Hong and H. Kim, “An Analytical Model for a GPU Architecture with Memory-Level and Thread-Level Parallelism Awareness,” ACM SIGARCH Comput. Archit. News 37 (3), 152-163 (2009).
    doi 10.1145/1555754.1555775.
  17. Y. Zhang and J. D. Owens, “A Quantitative Performance Analysis Model for GPU Architectures,” in Proc. IEEE 17th Int. Symposium on High Performance Computer Architecture, San Antonio, USA, February 12-16, 2011 (IEEE Press, New York, 2011), pp. 382-393,
    doi 10.1109/HPCA.2011.5749745.
  18. J. Lai and A. Seznec, “Break Down GPU Execution Time with an Analytical Method,” in Proc. 2012 Workshop on Rapid Simulation and Performance Evaluation: Methods and Tools, Paris, France, January 23, 2012 (ACM Press, New York, 2012), pp. 33-39,
    doi 10.1145/2162131.2162136.
  19. J. Sim, A. Dasgupta, H. Kim, and R. Vuduc, “A Performance Analysis Framework for Identifying Potential Benefits in GPGPU Applications,” ACM SIGPLAN Not. 47 (8), 11-22 (2012).
    doi 10.1145/2145816.2145819.
  20. J.-C.  Huang, J. H. Lee, H. Kim, and H.-H. S. Lee, “GPUMech: GPU Performance Modeling Technique Based on Interval Analysis,” in Proc. 47th Annual IEEE/ACM Int. Symposium on Microarchitecture, Cambridge, United Kingdom, December 13-17, 2014 (IEEE Press, Washington, DC, 2014), pp. 268-279,
    doi 10.1109/MICRO.2014.59.
  21. M. Amaris, D. Cordeiro, A. Goldman, and R. Y. De Camargo, “A Simple BSP-based Model to Predict Execution Time in GPU Applications,” in Proc. IEEE 22nd Int. Conf. on High Performance Computing, Bengaluru, India, December 16-19, 2015 (IEEE Press, Washington, DC, 2015), pp. 285-294,
    doi 10.1109/HiPC.2015.34.
  22. T. C. Carroll and P. W. H. Wong, “An Improved Abstract GPU Model with Data Transfer,” in Proc. 46th Int. Conf. on Parallel Processing Workshops, Bristol, United Kingdom, August 14-17, 2017 (IEEE Press, New York, 2017), pp. 113-120,
    doi 10.1109/ICPPW.2017.28.
  23. G. Alavani, K. Varma, and S. Sarkar, “Predicting Execution Time of CUDA Kernel Using Static Analysis,” in IEEE Int. Conf. on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom), Melbourne, Australia, December 11-13, 2018 (IEEE Press, New York, 2018), pp. 948-955,
    doi 10.1109/BDCloud.2018.00139.
  24. Q. Wang and X. Chu, GPGPU Performance Estimation with Core and Memory Frequency Scaling , ArXiv preprint: 1701.05308v2 [cs.PF] (Cornell Univ. Library, Ithaca, 2018).
    https://arxiv.org/abs/1701.05308 . Cited January 6, 2022.
  25. S. Salaria, A. Drozd, A. Podobas, and S. Matsuoka, “Learning Neural Representations for Predicting GPU Performance,” in Lecture Notes in Computer Science (Springer, Cham, 2019), Vol. 11501, pp. 40-58.
    doi 10.1007/978-3-030-20656-7_3.
  26. L. Braun, S. Nikas, C. Song., et al., A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels ArXiv preprint: 2001.07104v3 [cs.DC] (Cornell Univ. Library, Ithaca, 2020).
    https://arxiv.org/abs/2001.07104 . Cited January 6, 2022.
  27. T. T. Dao, J. Kim, S. Seo, et al., “A Performance Model for GPUs with Caches,” IEEE Trans. Parallel Distrib. Syst. 26 (7), 1800-1813 (2015).
    doi 10.1109/TPDS.2014.2333526.
  28. A. Karami, F. Khunjush, and S. A. Mirsoleimani, “A Statistical Performance Analyzer Framework for OpenCL Kernels on Nvidia GPUs,” J. Supercomput. 71 (8), 2900-2921 (2015).
    doi 10.1007/s11227-014-1338-z.
  29. P. Memarzia and F. Khunjush, “An In-depth Study on the Performance Impact of CUDA, OpenCL, and PTX Code,” J. Inf. Comput. Sci. 10 (2), 124-136 (2015).
  30. Z. Wang, B. He, W. Zhang, and S. Jiang, “A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs,” in Proc. 2016 IEEE Int. Symposium on High Performance ComputerArchitecture, Barcelona, Spain, March 12-16, 2016 (IEEE Press, New York, 2016), pp. 114-125,
    doi 10.1109/HPCA.2016.7446058.
  31. X. Wang, K. Huang, A. Knoll, and X. Qian, “A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation,” in Proc. IEEE Int. Symposium on High Performance Computer Architecture, Washington, DC, USA, February 16-20, 2019 (IEEE Press, New York, 2019), pp. 506-518,
    doi 10.1109/HPCA.2019.00062.
  32. B. Johnston, G. Falzon, and J. Milthorpe, “OpenCL Performance Prediction Using Architecture-Independent Features,” in Proc. Int. Workshop on High Performance Computing & Simulation Orleans, France, July 16-20, 2018 (IEEE Press, New York, 2018), pp. 561-569,
    doi 10.1109/HPCS.2018.00095.
  33. J. Price, “An OpenCL Device Simulator and Debugger,”
    https://github.com/jrprice/Oclgrind . Cited January 8, 2022.
  34. H. Wong, M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos, “Demystifying GPU Microarchitecture through Microbenchmarking,” in Proc. IEEE Int. Symposium on Performance Analysis of Systems & Software, White Plains, USA, March 28-30, 2010 (IEEE Press, New York, 2010), pp. 235-246,
    doi 10.1109/ISPASS.2010.5452013.
  35. V. V. Voevodin and Vl. V. Voevodin, Parallel Computing (BHV-Petersburg, St. Petersburg, 2002) [in Russian].
  36. R. Cuninghame-Green, Minimax Algebra (Springer, Berlin, 1979).
    doi 10.1007/978-3-642-48708-8.
  37. N. K. Krivulin, Methods of Idempotent Algebra for Problems in Modeling and Analysis of Complex Systems (St. Petersburg Univ. Press, St. Petersburg, 2009) [in Russian].
  38. E. E. Tyrtyshnikov, Fundamentals of Algebra (Fizmatlit, Moscow, 2017) [in Russian].