Related papers: Progressive Mixed-Precision Decoding for Efficient…

Mixed-Precision Quantization for Language Models: Techniques and Prospects

The rapid scaling of language models (LMs) has resulted in unprecedented computational, memory, and energy requirements, making their training and deployment increasingly unsustainable. Quantization has emerged as an essential compression…

Machine Learning · Computer Science 2025-10-21 Mariam Rakka , Marios Fournarakis , Olga Krestinskaya , Jinane Bazzi , Khaled N. Salama , Fadi Kurdahi , Ahmed M. Eltawil , Mohammed E. Fouda

Predictive Pipelined Decoding: A Compute-Latency Trade-off for Exact LLM Decoding

This paper presents "Predictive Pipelined Decoding (PPD)," an approach that speeds up greedy decoding in Large Language Models (LLMs) while maintaining the exact same output as the original decoding. Unlike conventional strategies, PPD…

Computation and Language · Computer Science 2024-07-30 Seongjun Yang , Gibbeum Lee , Jaewoong Cho , Dimitris Papailiopoulos , Kangwook Lee

LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization

Recent breakthroughs in Large-scale language models (LLMs) have demonstrated impressive performance on various tasks. The immense sizes of LLMs have led to very high resource demand and cost for running the models. Though the models are…

Machine Learning · Computer Science 2024-03-05 Juntao Zhao , Borui Wan , Yanghua Peng , Haibin Lin , Chuan Wu

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan

P3-LLM: An Integrated NPU-PIM Accelerator for Edge LLM Inference Using Hybrid Numerical Formats

The substantial memory bandwidth and computational demands of large language models (LLMs) present critical challenges for efficient inference. To tackle this, the literature has explored heterogeneous systems that combine neural processing…

Hardware Architecture · Computer Science 2026-05-05 Yuzong Chen , Chao Fang , Xilai Dai , Yuheng Wu , Thierry Tambe , Marian Verhelst , Mohamed S. Abdelfattah

APreQEL: Adaptive Mixed Precision Quantization For Edge LLMs

Today, large language models have demonstrated their strengths in various tasks ranging from reasoning, code generation, and complex problem solving. However, this advancement comes with a high computational cost and memory requirements,…

Machine Learning · Computer Science 2026-03-26 Meriem Bouzouad , Yuan-Hao Chang , Jalil Boukhobza

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute…

Information Retrieval · Computer Science 2026-04-06 Cornelius Kummer , Lena Jurkschat , Michael Färber , Sahar Vahdati

Lossless Acceleration of Large Language Model via Adaptive N-gram Parallel Decoding

While Large Language Models (LLMs) have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel…

Computation and Language · Computer Science 2024-07-11 Jie Ou , Yueming Chen , Wenhong Tian

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Mixed-precision inference techniques reduce the memory and computational demands of Large Language Models (LLMs) by applying hybrid precision formats to model weights, activations, and KV caches. However, existing systems struggle to (i)…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-05-18 Li Zhang , Youhe Jiang , Guoliang He , Xin Chen , Han Lv , Qian Yao , Ningsheng Ma , Fangcheng Fu , Kai Chen

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

Training Large Reasoning Models Efficiently via Progressive Thought Encoding

Large reasoning models (LRMs) excel on complex problems but face a critical barrier to efficiency: reinforcement learning (RL) training requires long rollouts for outcome-based rewards, where autoregressive decoding dominates time and…

Machine Learning · Computer Science 2026-02-20 Zeliang Zhang , Xiaodong Liu , Hao Cheng , Hao Sun , Chenliang Xu , Jianfeng Gao

Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization

Large language models (LLMs) have demonstrated impressive abilities in various domains while the inference cost is expensive. Many previous studies exploit quantization methods to reduce LLM inference cost by reducing latency and memory…

Machine Learning · Computer Science 2024-11-12 Jinhao Li , Jiaming Xu , Shiyao Li , Shan Huang , Jun Liu , Yaoxiu Lian , Guohao Dai

Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition

State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Low-bit neural network…

Computation and Language · Computer Science 2021-12-22 Junhao Xu , Jianwei Yu , Shoukang Hu , Xunying Liu , Helen Meng

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor…

Machine Learning · Computer Science 2025-03-14 Shaobo Ma , Chao Fang , Haikuo Shao , Zhongfeng Wang

Towards Efficient Pre-training: Exploring FP4 Precision in Large Language Models

The burgeoning computational demands for training large language models (LLMs) necessitate efficient methods, including quantized training, which leverages low-bit arithmetic operations to reduce costs. While FP8 precision has shown…

Machine Learning · Computer Science 2025-02-18 Jiecheng Zhou , Ding Tang , Rong Fu , Boni Hu , Haoran Xu , Yi Wang , Zhilin Pei , Zhongling Su , Liang Liu , Xingcheng Zhang , Weiming Zhang

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

Large language models (LLMs) face the challenges in fine-tuning and deployment due to their high memory demands and computational costs. While parameter-efficient fine-tuning (PEFT) methods aim to reduce the memory usage of the optimizer…

Machine Learning · Computer Science 2023-10-31 Jeonghoon Kim , Jung Hyun Lee , Sungdong Kim , Joonsuk Park , Kang Min Yoo , Se Jung Kwon , Dongsoo Lee

Beyond Real Weights: Hypercomplex Representations for Stable Quantization

Multimodal language models (MLLMs) require large parameter capacity to align high-dimensional visual features with linguistic representations, making them computationally heavy and difficult to deploy efficiently. We introduce a progressive…

Computer Vision and Pattern Recognition · Computer Science 2025-12-10 Jawad Ibn Ahad , Maisha Rahman , Amrijit Biswas , Muhammad Rafsan Kabir , Robin Krambroeckers , Sifat Momen , Nabeel Mohammed , Shafin Rahman

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

Quantization is a powerful tool to improve large language model (LLM) inference efficiency by utilizing more energy-efficient low-precision datapaths and reducing memory footprint. However, accurately quantizing LLM weights and activations…

Hardware Architecture · Computer Science 2025-04-22 Coleman Hooper , Charbel Sakr , Ben Keller , Rangharajan Venkatesan , Kurt Keutzer , Sophia Shao , Brucek Khailany

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng