Related papers: TP-Aware Dequantization

Inference Performance Optimization for Large Language Models on CPUs

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry.…

Artificial Intelligence · Computer Science 2024-07-11 Pujiang He , Shan Zhou , Wenhuan Huang , Changqing Li , Duyi Wang , Bin Guo , Chen Meng , Sheng Gui , Weifei Yu , Yi Xie

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory…

Machine Learning · Computer Science 2024-11-12 Jinhao Li , Jiaming Xu , Shiyao Li , Shan Huang , Jun Liu , Yaoxiu Lian , Guohao Dai

Fast NF4 Dequantization Kernels for Large Language Model Inference

Large language models (LLMs) have grown beyond the memory capacity of single GPU devices, necessitating quantization techniques for practical deployment. While NF4 (4-bit NormalFloat) quantization enables 4$\times$ memory reduction,…

Machine Learning · Computer Science 2026-04-06 Xiangbo Qi , Chaoyi Jiang , Murali Annavaram

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design

Six-bit quantization (FP6) can effectively reduce the size of large language models (LLMs) and preserve the model quality consistently across varied applications. However, existing systems do not provide Tensor Core support for FP6…

Machine Learning · Computer Science 2024-03-05 Haojun Xia , Zhen Zheng , Xiaoxia Wu , Shiyang Chen , Zhewei Yao , Stephen Youn , Arash Bakhtiari , Michael Wyatt , Donglin Zhuang , Zhongzhu Zhou , Olatunji Ruwase , Yuxiong He , Shuaiwen Leon Song

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss-error-aware Grid

Large language models (LLMs) have shown immense potential across various domains, but their high memory requirements and inference costs remain critical challenges for deployment. Post-training quantization (PTQ) has emerged as a promising…

Machine Learning · Computer Science 2026-01-05 Tianyi Zhang , Anshumali Shrivastava

LLMEasyQuant: Scalable Quantization for Parallel and Distributed LLM Inference

As large language models (LLMs) grow in size and deployment scale, quantization has become an essential technique for reducing memory footprint and improving inference efficiency. However, existing quantization toolkits often lack…

Machine Learning · Computer Science 2025-12-01 Dong Liu , Yanxuan Yu

Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization

Large language models (LLMs) demonstrate outstanding performance in various tasks in machine learning and have thus become one of the most important workloads in today's computing landscape. However, deploying LLM inference poses challenges…

Machine Learning · Computer Science 2024-06-21 Jungi Lee , Wonbeom Lee , Jaewoong Sim

Towards Low-bit Communication for Tensor Parallel LLM Inference

Tensor parallelism provides an effective way to increase server large language model (LLM) inference efficiency despite adding an additional communication cost. However, as server LLMs continue to scale in size, they will need to be…

Artificial Intelligence · Computer Science 2024-11-13 Harry Dong , Tyler Johnson , Minsik Cho , Emad Soroush

Communication Compression for Tensor Parallel LLM Inference

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators…

Machine Learning · Computer Science 2026-01-07 Jan Hansen-Palmus , Michael Truong Le , Oliver Hausdörfer , Alok Verma

RTP-LLM: High-Performance Alibaba LLM Inference Engine

Large Language Models (LLMs) have revolutionized AI applications, but deploying them at scale presents significant challenges. We present RTP-LLM, a high-performance inference engine for industrial-scale LLM deployment, successfully…

Operating Systems · Computer Science 2026-05-29 Boyu Tan , Jiarui Guo , Zongwei Lv , Hanbo Sun , Tong Yang , Kan Liu , Xinfei Shi , Zetao Hu , Yaxin Yu , Chi Zhang , Jianning Zhang , Xi Yang , Wei Zhang , Bo Cai , Silu Zhou , Xiyu Wang , Na He , Yinghao Yu , Wending Bao , Guiyang Huang , Yuxing Yuan , Juncheng Yin , Nan Wang , Lin Yang , Zechao Zhang , Lu Chen , Guoding Li , Tao Lan , Lin Qu

Efficient LLM Inference on CPUs

Large language models (LLMs) have demonstrated remarkable performance and tremendous potential across a wide range of tasks. However, deploying these models has been challenging due to the astronomical amount of model parameters, which…

Machine Learning · Computer Science 2023-12-08 Haihao Shen , Hanwen Chang , Bo Dong , Yu Luo , Hengyu Meng

Parallelization Strategies for Dense LLM Deployment: Navigating Through Application-Specific Tradeoffs and Bottlenecks

Breakthroughs in the generative AI domain have fueled an explosion of large language model (LLM)-powered applications, whose workloads fundamentally consist of sequences of inferences through transformer architectures. Within this rapidly…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-09 Burak Topcu , Musa Oguzhan Cim , Poovaiah Palangappa , Meena Arunachalam , Mahmut Taylan Kandemir

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization

Large language models (LLMs) have demonstrated state-of-the-art performance across various tasks. However, the latency of inference and the large GPU memory consumption of LLMs restrict their deployment performance. Recently, there have…

Machine Learning · Computer Science 2024-02-29 Yi Zhang , Fei Yang , Shuang Peng , Fangyu Wang , Aimin Pan

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Parallax: Efficient LLM Inference Service over Decentralized Environment

Deploying a large language model (LLM) inference service remains costly because centralized serving depends on specialized GPU clusters and high-bandwidth interconnects in datacenters. An appealing alternative is to leverage collaborative…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-01 Chris Tong , Youhe Jiang , Gufeng Chen , Tianyi Zhao , Sibian Lu , Wenjie Qu , Eric Yang , Lynn Ai , Binhang Yuan

FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design

Large Language Models (LLMs) demonstrate exceptional performance but entail significant memory and computational costs, restricting their practical deployment. While existing INT4/INT8 quantization reduces these costs, they often degrade…

Machine Learning · Computer Science 2025-11-04 Hao Zhang , Aining Jia , Weifeng Bu , Yushu Cai , Kai Sheng , Hao Chen , Xin He

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Large language models (LLMs) are useful in many NLP tasks and become more capable with size, with the best open-source models having over 50 billion parameters. However, using these 50B+ models requires high-end hardware, making them…

Machine Learning · Computer Science 2023-12-14 Alexander Borzunov , Max Ryabinin , Artem Chumachenko , Dmitry Baranchuk , Tim Dettmers , Younes Belkada , Pavel Samygin , Colin Raffel

Characterizing Communication Patterns in Distributed Large Language Model Inference

Large Language Models (LLMs) built on transformer architectures have transformed natural language processing, achieving remarkable performance across diverse applications. While distributed inference frameworks enable practical deployment…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-07-22 Lang Xu , Kaushik Kandadi Suresh , Quentin Anthony , Nawras Alnaasan , Dhabaleswar K. Panda