Related papers: Distributed Inference Performance Optimization for…

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Efficient LLM Inference on CPUs

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

Inference Acceleration for Large Language Models on CPUs

In recent years, large language models have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, deploying these models for real-world applications often requires efficient inference solutions…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-13 Ditto PS , Jithin VG , Adarsh MS

Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models

Large language models have demonstrated extraordinary performance in many AI tasks but are expensive to use, even after training, due to their requirement of high-end GPUs. Recently, a distributed system called PETALS was developed to lower…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-30 Tingyang Sun , Ting He , Bo Ji , Parimal Parag

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them…

Machine Learning · Computer Science 2023-12-14 Alexander Borzunov , Max Ryabinin , Artem Chumachenko , Dmitry Baranchuk , Tim Dettmers , Younes Belkada , Pavel Samygin , Colin Raffel

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices…

Computation and Language · Computer Science 2024-08-01 Keivan Alizadeh , Iman Mirzadeh , Dmitry Belenko , Karen Khatamifard , Minsik Cho , Carlo C Del Mundo , Mohammad Rastegari , Mehrdad Farajtabar

Extending Token Computation for LLM Reasoning

Large Language Models (LLMs) are pivotal in advancing natural language processing but often struggle with complex reasoning tasks due to inefficient attention distributions. In this paper, we explore the effect of increased computed tokens…

Computation and Language · Computer Science 2024-06-25 Bingli Liao , Danilo Vasconcellos Vargas

Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference

Aligning future system design with the ever-increasing compute needs of large language models (LLMs) is undoubtedly an important problem in today's world. Here, we propose a general performance modeling methodology and workload analysis of…

Hardware Architecture · Computer Science 2024-07-23 Joyjit Kundu , Wenzhe Guo , Ali BanaGozar , Udari De Alwis , Sourav Sengupta , Puneet Gupta , Arindam Mallik

SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization

Large language models (LLMs) have been a disruptive innovation in recent years, and they play a crucial role in our daily lives due to their ability to understand and generate human-like text. Their capabilities include natural language…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-10-17 Akrit Mudvari , Yuang Jiang , Leandros Tassiulas

Efficient LLM inference solution on Intel GPU

Transformer based Large Language Models (LLMs) have been widely used in many fields, and the efficiency of LLM inference becomes hot topic in real applications. However, LLMs are usually complicatedly designed in model structure with…

Hardware Architecture · Computer Science 2024-06-25 Hui Wu , Yi Gan , Feng Yuan , Jing Ma , Wei Zhu , Yutao Xu , Hong Zhu , Yuhua Zhu , Xiaoli Liu , Jinghui Gu , Peng Zhao

HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices

In recent times, the emergence of Large Language Models (LLMs) has resulted in increasingly larger model size, posing challenges for inference on low-resource devices. Prior approaches have explored offloading to facilitate low-memory…

Performance · Computer Science 2024-03-05 Xuanlei Zhao , Bin Jia , Haotian Zhou , Ziming Liu , Shenggan Cheng , Yang You

Highly Optimized Kernels and Fine-Grained Codebooks for LLM Inference on Arm CPUs

Large language models (LLMs) have transformed the way we think about language understanding and generation, enthralling both researchers and developers. However, deploying LLMs for inference has been a significant challenge due to their…

Machine Learning · Computer Science 2025-01-03 Dibakar Gope , David Mansell , Danny Loh , Ian Bratt

Confidential LLM Inference: Performance and Cost Across CPU and GPU TEEs

Large Language Models (LLMs) are increasingly deployed on converged Cloud and High-Performance Computing (HPC) infrastructure. However, as LLMs handle confidential inputs and are fine-tuned on costly, proprietary datasets, their heightened…

Performance · Computer Science 2025-09-24 Marcin Chrapek , Marcin Copik , Etienne Mettaz , Torsten Hoefler

LIMINAL: Exploring The Frontiers of LLM Decode Performance

The rapid advancement of Large Language Models (LLMs) necessitates a deep understanding of their fundamental performance limits. This paper investigates the limits of LLM inference, focusing on hardware-imposed bottlenecks in…

Hardware Architecture · Computer Science 2025-11-17 Michael Davies , Neal Crago , Karthikeyan Sankaralingam , Christos Kozyrakis

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and…

Performance · Computer Science 2023-12-04 Longteng Zhang , Xiang Liu , Zeyu Li , Xinglin Pan , Peijie Dong , Ruibo Fan , Rui Guo , Xin Wang , Qiong Luo , Shaohuai Shi , Xiaowen Chu

LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators

Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges,…

Machine Learning · Computer Science 2024-11-04 Krishna Teja Chitty-Venkata , Siddhisanket Raskar , Bharat Kale , Farah Ferdaus , Aditya Tanikanti , Ken Raffenetti , Valerie Taylor , Murali Emani , Venkatram Vishwanath

TokenSim: Enabling Hardware and Software Exploration for Large Language Model Inference Systems

The increasing demand for large language model (LLM) serving has necessitated significant advancements in the optimization and profiling of LLM inference systems. As these models become integral to a wide range of applications, the need for…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-20 Feiyang Wu , Zhuohang Bian , Guoyang Duan , Tianle Xu , Junchi Wu , Teng Ma , Yongqiang Yao , Ruihao Gong , Youwei Zhuo

LLMem: Estimating GPU Memory Usage for Fine-Tuning Pre-Trained LLMs

Fine-tuning pre-trained large language models (LLMs) with limited hardware presents challenges due to GPU memory constraints. Various distributed fine-tuning methods have been proposed to alleviate memory constraints on GPU. However,…

Artificial Intelligence · Computer Science 2024-04-18 Taeho Kim , Yanming Wang , Vatshank Chaturvedi , Lokesh Gupta , Seyeon Kim , Yongin Kwon , Sangtae Ha