• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Guo, Zhenhua (Guo, Zhenhua.) | Tang, Yinan (Tang, Yinan.) | Zhai, Jidong (Zhai, Jidong.) | Yuan, Tongtong (Yuan, Tongtong.) | Jin, Jian (Jin, Jian.) | Wang, Li (Wang, Li.) | Zhao, Yaqian (Zhao, Yaqian.) | Li, Rengang (Li, Rengang.)

Indexed by:

EI Scopus SCIE

Abstract:

The recent breakthroughs in large-scale DNN attract significant attention from both academia and industry toward distributed DNN training techniques. Due to the time-consuming and expensive execution process of large-scale distributed DNN training, it is crucial to model and predict the performance of distributed DNN training before its actual deployment, in order to optimize the design of distributed DNN training at low cost. This paper analyzes and emphasizes the importance of modeling and predicting the performance of distributed DNN training, categorizes and analyses the related state-of-the-art works, and discusses future challenges and opportunities for this research field. The objectives of this paper are twofold: first, to assist researchers in understanding and choosing suitable modeling and prediction tools for large-scale distributed DNN training, and second, to encourage researchers to propose more valuable research about performance modeling and prediction for distributed DNN training in the future.

Keyword:

Training Predictive models Tensors Servers Pipelines Computational modeling distributed computing systems Software performance modeling Distributed DNN training Costs Optimization large-scale machine learning Parallel processing

Author Community:

  • [ 1 ] [Guo, Zhenhua]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
  • [ 2 ] [Tang, Yinan]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
  • [ 3 ] [Wang, Li]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
  • [ 4 ] [Zhao, Yaqian]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
  • [ 5 ] [Li, Rengang]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
  • [ 6 ] [Zhai, Jidong]Tsinghua Univ, Beijing 100084, Peoples R China
  • [ 7 ] [Yuan, Tongtong]Beijing Univ Technol, Beijing 100124, Peoples R China
  • [ 8 ] [Jin, Jian]CAICT, Beijing 100191, Peoples R China

Reprint Author's Address:

  • [Tang, Yinan]IEIT Syst Co Ltd, Jinan 250101, Peoples R China;;

Show more details

Related Keywords:

Source :

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

ISSN: 1045-9219

Year: 2024

Issue: 12

Volume: 35

Page: 2463-2478

5 . 3 0 0

JCR@2022

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count:

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 9

Affiliated Colleges:

Online/Total:368/10564427
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.