Indexed by:
Abstract:
Distributed deep learning systems are the engine of large-scale deep learning model training. As both of the volume of training datasets and the complexity of training models go up, the resource cost for the deep learning model training increases significantly, and hence, becomes a new concern of the distributed deep learning systems. In the distributed deep learning systems, the resource allocation refers to the number of computing nodes allocated to a parallel model training job and the batch sizing determines the training data size processed by a single training task. The empirical studies demonstrate that, from the perspective of the resource cost optimization, there is complex interdependence between the configurations of resource allocation and batch sizing. However, extant works ignore such interdependence and only take these two configuration methods as the independent ways to optimize the accuracy and computational efficiency of distributed deep learning model training respectively, and hence, are difficult to meet both goals of maximizing training accuracy with the training time constraint and minimizing resource cost. Aiming at this issue, a collaborative configuration method of resource allocation and batch sizing is proposed for distributed deep learning systems in this paper. Here, the resource cost is defined as the product of resource allocation and the training time. The proposed collaborative method is designed based on the observation that both function relationships of the resource allocation to the training time, and the batch sizing to the training accuracy, are monotonic. In the proposed method, the training accuracy prediction model and training time prediction model are first established with the isotonic regression technique. The training time prediction model is established as a function of the resource allocation and batch sizing and it can predict the elapsed time of one training epoch. The training accuracy prediction model is a function of the bath sizing and the total number of training epochs and it can predict the convergent accuracy of the model training. Then, with the given training time, the total amount of training epoch under the different configurations of resource allocation and batch sizing can be calculated with the training time model, and correspondingly, the convergent training accuracy can be predicted with the training accuracy prediction model. Finally, based on the above predictions, the optimized configurations of resource allocation and batch sizing, that satisfy the requirements of training accuracy, time constraint and resource cost, can be found by using Tabu search heuristics. We implement our proposed method in TensorFlow and verify its efficiency with representative deep learning models and training datasets. Experimental results demonstrate that the mean relative error of the proposed training time prediction model and training accuracy prediction model is 7.5% and 1.65% by maximum, respectively. Compared to the independent configuration methods, the proposed collaborative method can reduce the resource cost of deep learning model training by the maximum of 26.89%. © 2022, Science Press. All right reserved.
Keyword:
Reprint Author's Address:
Email:
Source :
Chinese Journal of Computers
ISSN: 0254-4164
Year: 2022
Issue: 2
Volume: 45
Page: 302-316
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count: 2
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 6
Affiliated Colleges: