Indexed by:
Abstract:
As we know, the training of Large Language Models (LLM) is time-consuming and expensive. Its training efficiency is often affected by both the heterogeneous computing devices and heterogeneous communication networks in the computing cluster. In recent years, new computing devices and technologies such as NVIDIA H200 and Compute Express Link (CXL) 3.0 have been proposed, bringing new opportunities for improving the training efficiency of LLM. However, the actual deployment difficulty and cost of these new devices or technologies are extremely high, so it is difficult for researchers to evaluate their impacts or improvements on LLM training. In order to solve this problem, this paper introduces a simulation tool named HeterSim, and proposes to simulate and evaluate LLM training in CXL-based heterogeneous computing clusters using HeterSim. This article takes the LLM called LLaMA as a simulation example, and successfully simulates and analyzes the impact of heterogeneous computing and CXL technologies on LLM training. We hope that this article can provide researchers with new ideas for simulating and analyzing LLM training, and help researchers explore the impact of emerging technologies on LLM training at low cost. © 2024 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2024
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 9
Affiliated Colleges: