Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He; Shan Zhou; Changqing Li; Wenhuan Huang; Weifei Yu; Duyi Wang; Chen Meng; Sheng Gui

Distributed Inference Performance Optimization for LLMs on CPUs

Distributed, Parallel, and Cluster Computing 2024-07-02 v1

Authors: Pujiang He , Shan Zhou , Changqing Li , Wenhuan Huang , Weifei Yu , Duyi Wang , Chen Meng , Sheng Gui

Abstract

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.

Keywords

large language model inference key-value cache large language model training

Cite

@article{arxiv.2407.00029,
  title  = {Distributed Inference Performance Optimization for LLMs on CPUs},
  author = {Pujiang He and Shan Zhou and Changqing Li and Wenhuan Huang and Weifei Yu and Duyi Wang and Chen Meng and Sheng Gui},
  journal= {arXiv preprint arXiv:2407.00029},
  year   = {2024}
}

Comments

4 pages, 3 figures, Practical ML for Low Resource Settings Workshop @ ICLR 2024

Distributed Inference Performance Optimization for LLMs on CPUs

Abstract

Keywords

Cite

Comments

Related papers