English
Related papers

Related papers: Efficient LLM inference solution on Intel GPU

200 papers

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

The recent surge of open-source large language models (LLMs) enables developers to create AI-based solutions while maintaining control over aspects such as privacy and compliance, thereby providing governance and ownership of the model…

Software Engineering · Computer Science 2024-08-05 Matias Martinez

Large Language Models (LLMs) exhibit pronounced memory-bound characteristics during inference due to High Bandwidth Memory (HBM) bandwidth constraints. In this paper, we propose an L2 Cache-oriented asynchronous KV Cache prefetching method…

Machine Learning · Computer Science 2025-11-11 Yanhao Dong , Yubo Miao , Weinan Li , Xiao Zheng , Chao Wang , Jiesheng Wu , Feng Lyu

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

Transformers, driven by attention mechanisms, form the foundation of large language models (LLMs). As these models scale up, efficient GPU attention kernels become essential for high-throughput and low-latency inference. Diverse LLM…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-04-23 Zihao Ye , Lequn Chen , Ruihang Lai , Wuwei Lin , Yineng Zhang , Stephanie Wang , Tianqi Chen , Baris Kasikci , Vinod Grover , Arvind Krishnamurthy , Luis Ceze

LLMs are increasingly used world-wide from daily tasks to agentic systems and data analytics, requiring significant GPU resources. LLM inference systems, however, are slow compared to database systems, and inference performance and…

Performance · Computer Science 2025-10-03 Kyoungmin Kim , Jiacheng Li , Kijae Hong , Anastasia Ailamaki

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints,…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-03 Sanghyeon Lee , Hongbeen Kim , Soojin Hwang , Guseul Heo , Minwoo Noh , Jaehyuk Huh

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity…

Hardware Architecture · Computer Science 2025-04-23 Rui Xie , Asad Ul Haq , Linsen Ma , Yunhua Fang , Zirak Burzin Engineer , Liu Liu , Tong Zhang

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Large language models~(LLMs) are known for their high demand on computing resources and memory due to their substantial model size, which leads to inefficient inference on moderate GPU systems. Techniques like quantization or pruning can…

Computational Engineering, Finance, and Science · Computer Science 2024-11-26 Wenxiang Lin , Xinglin Pan , Shaohuai Shi , Xuan Wang , Xiaowen Chu

LLM inference exhibits substantial variability across queries and execution phases, yet inference configurations are often applied uniformly. We present a measurement-driven characterization of workload heterogeneity and energy-performance…

Machine Learning · Computer Science 2026-02-25 Paul Joe Maliakel , Shashikant Ilager , Ivona Brandic

Large language models (LLMs) are central to modern natural language processing, delivering exceptional performance in various tasks. However, their substantial computational and memory requirements present challenges, especially for devices…

Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and…

Hardware Architecture · Computer Science 2025-05-06 Yufeng Gu , Alireza Khadem , Sumanth Umesh , Ning Liang , Xavier Servot , Onur Mutlu , Ravi Iyer , Reetuparna Das

Efficient inference of LLMs remains a crucial challenge, with two main phases: a compute-intensive prompt computation and a memory-intensive token generation. Despite existing batching and scheduling techniques, token generation phases fail…

Hardware Architecture · Computer Science 2025-05-08 Asad Aali , Adney Cardoza , Melissa Capo

Transformer-based Large Language Models (LLMs) have made a significant impact on various domains. However, LLMs' efficiency suffers from both heavy computation and memory overheads. Compression techniques like sparsification and…

Recently the generative Large Language Model (LLM) has achieved remarkable success in numerous applications. Notably its inference generates output tokens one-by-one, leading to many redundant computations. The widely-used KV-Cache…

Machine Learning · Computer Science 2024-12-10 Weizhuo Li , Zhigang Wang , Yu Gu , Ge Yu

Cost of serving large language models (LLM) is high, but the expensive and scarce GPUs are poorly efficient when generating tokens sequentially, unless the batch of sequences is enlarged. However, the batch size is limited by some…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-03-19 Jiaao He , Jidong Zhai
‹ Prev 1 2 3 10 Next ›