Simulating LLM Training in CXL-Based Heterogeneous Computing Cluster - Details

Author：

Indexed by：

CPCI-S EI Scopus

Abstract：

As　we　know,　the　training　of　Large　Language　Models　(LLM)　is　time-consuming　and　expensive.　Its　training　efficiency　is　often　affected　by　both　the　heterogeneous　computing　devices　and　heterogeneous　communication　networks　in　the　computing　cluster.　In　recent　years,　new　computing　devices　and　technologies　such　as　NVIDIA　H200　and　Compute　Express　Link　(CXL)　3.0　have　been　proposed,　bringing　new　opportunities　for　improving　the　training　efficiency　of　LLM.　However,　the　actual　deployment　difficulty　and　cost　of　these　new　devices　or　technologies　are　extremely　high,　so　it　is　difficult　for　researchers　to　evaluate　their　impacts　or　improvements　on　LLM　training.　In　order　to　solve　this　problem,　this　paper　introduces　a　simulation　tool　named　HeterSim,　and　proposes　to　simulate　and　evaluate　LLM　training　in　CXL-based　heterogeneous　computing　clusters　using　HeterSim.　This　article　takes　the　LLM　called　LLaMA　as　a　simulation　example,　and　successfully　simulates　and　analyzes　the　impact　of　heterogeneous　computing　and　CXL　technologies　on　LLM　training.　We　hope　that　this　article　can　provide　researchers　with　new　ideas　for　simulating　and　analyzing　LLM　training,　and　help　researchers　explore　the　impact　of　emerging　technologies　on　LLM　training　at　low　cost.　©　2024　IEEE.

Keyword：

CXL Distributed training Simulation Heterogeneous computing Large language model

Author Community：

[ 1 ] [Tang Y.]Inspur Electronic Information Industry Co., Ltd, China
[ 2 ] [Yuan T.]Beijing University of Technology, China
[ 3 ] [Cao F.]Inspur Electronic Information Industry Co., Ltd, China
[ 4 ] [Wang L.]Inspur Electronic Information Industry Co., Ltd, China
[ 5 ] [Guo Z.]Inspur Electronic Information Industry Co., Ltd, China
[ 6 ] [Zhao Y.]Inspur Electronic Information Industry Co., Ltd, China
[ 7 ] [Li R.]Inspur Electronic Information Industry Co., Ltd, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

A Task Parallel Programming Framework Based on Heterogeneous Computing Platforms
2020，5th International Symposium on Intelligent Systems Technologies and Applications, ISTA 2019
Research on the distributed training method for linear SVM in WSN
2015，Journal of Electronics and Information Technology
A Survey of the Heterogeneous Computing Platform and Related Technologies
2016，International Conference on Informatics, Management Engineering and Industrial Application (IMEIA)
Edge-cloud Collaborative Heterogeneous Task Scheduling in Multilayer Elastic Optical Networks
2021，2021 IEEE Global Communications Conference, GLOBECOM 2021

Source ：

Year： 2024

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 9

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to