A Collaborative Method for Resource Allocation and Batch Sizing on Distributed Deep Learning System - Details

Author：

Indexed by：

EI Scopus

Abstract：

Distributed　deep　learning　systems　are　the　engine　of　large-scale　deep　learning　model　training.　As　both　of　the　volume　of　training　datasets　and　the　complexity　of　training　models　go　up,　the　resource　cost　for　the　deep　learning　model　training　increases　significantly,　and　hence,　becomes　a　new　concern　of　the　distributed　deep　learning　systems.　In　the　distributed　deep　learning　systems,　the　resource　allocation　refers　to　the　number　of　computing　nodes　allocated　to　a　parallel　model　training　job　and　the　batch　sizing　determines　the　training　data　size　processed　by　a　single　training　task.　The　empirical　studies　demonstrate　that,　from　the　perspective　of　the　resource　cost　optimization,　there　is　complex　interdependence　between　the　configurations　of　resource　allocation　and　batch　sizing.　However,　extant　works　ignore　such　interdependence　and　only　take　these　two　configuration　methods　as　the　independent　ways　to　optimize　the　accuracy　and　computational　efficiency　of　distributed　deep　learning　model　training　respectively,　and　hence,　are　difficult　to　meet　both　goals　of　maximizing　training　accuracy　with　the　training　time　constraint　and　minimizing　resource　cost.　Aiming　at　this　issue,　a　collaborative　configuration　method　of　resource　allocation　and　batch　sizing　is　proposed　for　distributed　deep　learning　systems　in　this　paper.　Here,　the　resource　cost　is　defined　as　the　product　of　resource　allocation　and　the　training　time.　The　proposed　collaborative　method　is　designed　based　on　the　observation　that　both　function　relationships　of　the　resource　allocation　to　the　training　time,　and　the　batch　sizing　to　the　training　accuracy,　are　monotonic.　In　the　proposed　method,　the　training　accuracy　prediction　model　and　training　time　prediction　model　are　first　established　with　the　isotonic　regression　technique.　The　training　time　prediction　model　is　established　as　a　function　of　the　resource　allocation　and　batch　sizing　and　it　can　predict　the　elapsed　time　of　one　training　epoch.　The　training　accuracy　prediction　model　is　a　function　of　the　bath　sizing　and　the　total　number　of　training　epochs　and　it　can　predict　the　convergent　accuracy　of　the　model　training.　Then,　with　the　given　training　time,　the　total　amount　of　training　epoch　under　the　different　configurations　of　resource　allocation　and　batch　sizing　can　be　calculated　with　the　training　time　model,　and　correspondingly,　the　convergent　training　accuracy　can　be　predicted　with　the　training　accuracy　prediction　model.　Finally,　based　on　the　above　predictions,　the　optimized　configurations　of　resource　allocation　and　batch　sizing,　that　satisfy　the　requirements　of　training　accuracy,　time　constraint　and　resource　cost,　can　be　found　by　using　Tabu　search　heuristics.　We　implement　our　proposed　method　in　TensorFlow　and　verify　its　efficiency　with　representative　deep　learning　models　and　training　datasets.　Experimental　results　demonstrate　that　the　mean　relative　error　of　the　proposed　training　time　prediction　model　and　training　accuracy　prediction　model　is　7.5%　and　1.65%　by　maximum,　respectively.　Compared　to　the　independent　configuration　methods,　the　proposed　collaborative　method　can　reduce　the　resource　cost　of　deep　learning　model　training　by　the　maximum　of　26.89%.　©　2022,　Science　Press.　All　right　reserved.

Keyword：

Resource allocation Heuristic algorithms Deep learning Computational efficiency Forecasting Costs Tabu search

Author Community：

[ 1 ] [Liang, Yi]Faculty of Information, Beijing University of Technology, Beijing; 100124, China
[ 2 ] [Ding, Zhen-Xing]Faculty of Information, Beijing University of Technology, Beijing; 100124, China
[ 3 ] [Zhao, Yu]Faculty of Information, Beijing University of Technology, Beijing; 100124, China
[ 4 ] [Liu, Ming-Jie]Faculty of Information, Beijing University of Technology, Beijing; 100124, China
[ 5 ] [Liu, Ming-Jie]Institute of Beijing Electro-Mechanical Engineering, Beijing; 100074, China
[ 6 ] [Pan, Yong]Beijing Computing Center, Beijing; 100094, China
[ 7 ] [Jin, Yi]Beijing Computing Center, Beijing; 100094, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Virtual Network Embedding for Task Offloading in IIoT: A DRL-Assisted Federated Learning Scheme
2024，IEEE Transactions on Industrial Informatics
Incorporating Distributed DRL Into Storage Resource Optimization of Space-Air-Ground Integrated Wireless Communication Network
2022，IEEE Journal on Selected Topics in Signal Processing
Utility-Oriented Optimization for Video Streaming in UAV-Aided MEC Network: A DRL Approach
2024，IEEE Transactions on Green Communications and Networking
Deep Reinforcement Learning for Computation and Communication Resource Allocation in Multiaccess MEC Assisted Railway IoT Networks
2022，IEEE Transactions on Intelligent Transportation Systems

Source ：

Chinese Journal of Computers

ISSN： 0254-4164

Year： 2022

Issue： 2

Volume： 45

Page： 302-316

Cited Count：

WoS CC Cited Count： 0

SCOPUS Cited Count： 2

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 6

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to