Related papers: Efficient LLM inference solution on Intel GPU

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

The Impact of Hyperparameters on Large Language Model Inference Performance: An Evaluation of vLLM and HuggingFace Pipelines

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model…

Software Engineering · Computer Science 2024-08-05 Matias Martinez

Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method…

Machine Learning · Computer Science 2025-11-11 Yanhao Dong , Yubo Miao , Weinan Li , Xiao Zheng , Chao Wang , Jiesheng Wu , Feng Lyu

Efficient LLM Inference on CPUs

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

FlashDecoding++: Faster Large Language Model Inference on GPUs

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Zihao Ye , Lequn Chen , Ruihang Lai , Wuwei Lin , Yineng Zhang , Stephanie Wang , Tianqi Chen , Baris Kasikci , Vinod Grover , Arvind Krishnamurthy , Luis Ceze

Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and…

Performance · Computer Science 2025-10-03 Kyoungmin Kim , Jiacheng Li , Kijae Hong , Anastasia Ailamaki

Efficient LLM Inference with Kcache

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-03 Sanghyeon Lee , Hongbeen Kim , Soojin Hwang , Guseul Heo , Minwoo Noh , Jaehyuk Huh

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity…

Hardware Architecture · Computer Science 2025-04-23 Rui Xie , Asad Ul Haq , Linsen Ma , Yunhua Fang , Zirak Burzin Engineer , Liu Liu , Tong Zhang

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Task Scheduling for Efficient Inference of Large Language Models on Single Moderate GPU Systems

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

Characterizing LLM Inference Energy-Performance Tradeoffs across Workloads and GPU Scaling

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance…

Machine Learning · Computer Science 2026-02-25 Paul Joe Maliakel , Shashikant Ilager , Ivona Brandic

LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices…

Computation and Language · Computer Science 2024-08-01 Keivan Alizadeh , Iman Mirzadeh , Dmitry Belenko , Karen Khatamifard , Minsik Cho , Carlo C Del Mundo , Mohammad Rastegari , Mehrdad Farajtabar

PIM Is All You Need: A CXL-Enabled GPU-Free System for Large Language Model Inference

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and…

Hardware Architecture · Computer Science 2025-05-06 Yufeng Gu , Alireza Khadem , Sumanth Umesh , Ning Liang , Xavier Servot , Onur Mutlu , Ravi Iyer , Reetuparna Das

Splitwiser: Efficient LM inference with constrained resources

Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail…

Hardware Architecture · Computer Science 2025-05-08 Asad Aali , Adney Cardoza , Melissa Capo

FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and…

Hardware Architecture · Computer Science 2024-01-10 Shulin Zeng , Jun Liu , Guohao Dai , Xinhao Yang , Tianyu Fu , Hongyi Wang , Wenheng Ma , Hanbo Sun , Shiyao Li , Zixiao Huang , Yadong Dai , Jintao Li , Zehao Wang , Ruoyu Zhang , Kairui Wen , Xuefei Ning , Yu Wang

XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai