A Survey on Performance Modeling and Prediction for Distributed DNN Training - Details

Author：

Indexed by：

EI Scopus SCIE

Abstract：

The　recent　breakthroughs　in　large-scale　DNN　attract　significant　attention　from　both　academia　and　industry　toward　distributed　DNN　training　techniques.　Due　to　the　time-consuming　and　expensive　execution　process　of　large-scale　distributed　DNN　training,　it　is　crucial　to　model　and　predict　the　performance　of　distributed　DNN　training　before　its　actual　deployment,　in　order　to　optimize　the　design　of　distributed　DNN　training　at　low　cost.　This　paper　analyzes　and　emphasizes　the　importance　of　modeling　and　predicting　the　performance　of　distributed　DNN　training,　categorizes　and　analyses　the　related　state-of-the-art　works,　and　discusses　future　challenges　and　opportunities　for　this　research　field.　The　objectives　of　this　paper　are　twofold:　first,　to　assist　researchers　in　understanding　and　choosing　suitable　modeling　and　prediction　tools　for　large-scale　distributed　DNN　training,　and　second,　to　encourage　researchers　to　propose　more　valuable　research　about　performance　modeling　and　prediction　for　distributed　DNN　training　in　the　future.

Keyword：

Training Predictive models Tensors Servers Pipelines Computational modeling distributed computing systems Software performance modeling Distributed DNN training Costs Optimization large-scale machine learning Parallel processing

Author Community：

[ 1 ] [Guo, Zhenhua]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
[ 2 ] [Tang, Yinan]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
[ 3 ] [Wang, Li]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
[ 4 ] [Zhao, Yaqian]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
[ 5 ] [Li, Rengang]IEIT Syst Co Ltd, Jinan 250101, Peoples R China
[ 6 ] [Zhai, Jidong]Tsinghua Univ, Beijing 100084, Peoples R China
[ 7 ] [Yuan, Tongtong]Beijing Univ Technol, Beijing 100124, Peoples R China
[ 8 ] [Jin, Jian]CAICT, Beijing 100191, Peoples R China

Reprint Author's Address：

[Tang, Yinan]IEIT Syst Co Ltd, Jinan 250101, Peoples R China;;

Email：

Show more details

Related Keywords：

Deep Learning-Based Model Predictive Control for Continuous Stirred-Tank Reactor System
2021，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
Distributed Multi-Agent Reinforcement Learning for Cooperative Low-Carbon Control of Traffic Network Flow Using Cloud-Based Parallel Optimization
2024，IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS
Adaptive Multioutput Gradient RBF Tracker for Nonlinear and Nonstationary Regression
2023，IEEE Transactions on Cybernetics
A Hybrid Model for Data Prediction in Real-World Wireless Sensor Networks
2021，IEEE COMMUNICATIONS LETTERS

Source ：

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS

ISSN： 1045-9219

Year： 2024

Issue： 12

Volume： 35

Page： 2463-2478

5 . 3 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 9

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to