Related papers: Learning to Keep a Promise: Scaling Language Model…

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Plato: Plan to Efficiently Decode for Large Language Model Inference

Large language models (LLMs) have achieved remarkable success in natural language tasks, but their inference incurs substantial computational and memory overhead. To improve efficiency, parallel decoding methods like Skeleton-of-Thought…

Computation and Language · Computer Science 2025-04-15 Shuowei Jin , Xueshen Liu , Yongji Wu , Haizhong Zheng , Qingzhao Zhang , Atul Prakash , Matthew Lentz , Danyang Zhuo , Feng Qian , Z. Morley Mao

Tell Your Model Where to Attend: Post-hoc Attention Steering for LLMs

In human-written articles, we often leverage the subtleties of text style, such as bold and italics, to guide the attention of readers. These textual emphases are vital for the readers to grasp the conveyed information. When interacting…

Computation and Language · Computer Science 2024-10-02 Qingru Zhang , Chandan Singh , Liyuan Liu , Xiaodong Liu , Bin Yu , Jianfeng Gao , Tuo Zhao

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

AdaDecode: Accelerating LLM Decoding with Adaptive Layer Parallelism

Large language models (LLMs) are increasingly used for long-content generation (e.g., long Chain-of-Thought reasoning) where decoding efficiency becomes a critical bottleneck: Autoregressive decoding is inherently limited by its sequential…

Computation and Language · Computer Science 2025-06-05 Zhepei Wei , Wei-Lin Chen , Xinyu Zhu , Yu Meng

Temporal Self-Rewarding Language Models: Decoupling Chosen-Rejected via Past-Future

Self-Rewarding Language Models propose an architecture in which the Large Language Models(LLMs) both generates responses and evaluates its own outputs via LLM-as-a-Judge prompting, dynamically improving its generative capabilities through…

Computation and Language · Computer Science 2025-08-11 Yidong Wang , Xin Wang , Cunxiang Wang , Junfeng Fang , Qiufeng Wang , Jianing Chu , Xuran Meng , Shuxun Yang , Libo Qin , Yue Zhang , Wei Ye , Shikun Zhang

Learning Adaptive LLM Decoding

Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We…

Machine Learning · Computer Science 2026-03-17 Chloe H. Su , Zhe Ye , Samuel Tenka , Aidan Yang , Soonho Kong , Udaya Ghai

Learning to Decode Collaboratively with Multiple Language Models

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the…

Computation and Language · Computer Science 2024-08-28 Shannon Zejiang Shen , Hunter Lang , Bailin Wang , Yoon Kim , David Sontag

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes…

Computation and Language · Computer Science 2026-05-18 Shengyin Sun , Yiming Li , Renxi Liu , Xinqi Li , Hui-Ling Zhen , Weizhe Lin , Chen Chen , Xianzhi Yu , Mingxuan Yuan , Chen Ma

Inference with Reference: Lossless Acceleration of Large Language Models

We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the…

Computation and Language · Computer Science 2023-04-11 Nan Yang , Tao Ge , Liang Wang , Binxing Jiao , Daxin Jiang , Linjun Yang , Rangan Majumder , Furu Wei

PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning

Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream…

Computation and Language · Computer Science 2024-06-27 Shiva Kumar Pentyala , Zhichao Wang , Bin Bi , Kiran Ramnath , Xiang-Bo Mao , Regunathan Radhakrishnan , Sitaram Asur , Na , Cheng

Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding

This research aims to accelerate the inference speed of large language models (LLMs) with billions of parameters. We propose \textbf{S}mart \textbf{P}arallel \textbf{A}uto-\textbf{C}orrect d\textbf{E}coding (SPACE), an innovative approach…

Computation and Language · Computer Science 2024-05-21 Hanling Yi , Feng Lin , Hongbin Li , Peiyang Ning , Xiaotian Yu , Rong Xiao

Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering

Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful…

Computation and Language · Computer Science 2024-09-18 Qingru Zhang , Xiaodong Yu , Chandan Singh , Xiaodong Liu , Liyuan Liu , Jianfeng Gao , Tuo Zhao , Dan Roth , Hao Cheng

PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

Watermarking for large language models (LLMs) is a promising approach for detecting LLM-generated text and enabling responsible deployment. However, existing watermarking methods are often vulnerable to semantic-invariant attacks, such as…

Cryptography and Security · Computer Science 2026-05-26 Zhenxin Ai , Haiyun He

LESA: Learnable LLM Layer Scaling-Up

Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger…

Machine Learning · Computer Science 2025-02-20 Yifei Yang , Zouying Cao , Xinbei Ma , Yao Yao , Libo Qin , Zhi Chen , Hai Zhao

dParallel: Learnable Parallel Decoding for dLLMs

Diffusion large language models (dLLMs) have recently drawn considerable attention within the research community as a promising alternative to autoregressive generation, offering parallel token prediction and lower inference latency. Yet,…

Computation and Language · Computer Science 2025-10-01 Zigeng Chen , Gongfan Fang , Xinyin Ma , Ruonan Yu , Xinchao Wang

Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Autoregressive decoding of large language models (LLMs) is memory bandwidth bounded, resulting in high latency and significant wastes of the parallel processing power of modern accelerators. Existing methods for accelerating LLM decoding…

Machine Learning · Computer Science 2024-02-06 Yichao Fu , Peter Bailis , Ion Stoica , Hao Zhang

APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding

The massive adoption of large language models (LLMs) demands efficient deployment strategies. However, the auto-regressive decoding process, which is fundamental to how most LLMs generate text, poses challenges to achieve efficient serving.…

Computation and Language · Computer Science 2024-01-15 Mingdao Liu , Aohan Zeng , Bowen Wang , Peng Zhang , Jie Tang , Yuxiao Dong

Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference

The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has investigated various speculative decoding techniques for multi-token generation, these…

Machine Learning · Computer Science 2025-10-01 Hao Mark Chen , Wayne Luk , Ka Fai Cedric Yiu , Rui Li , Konstantin Mishchenko , Stylianos I. Venieris , Hongxiang Fan

Faster Speech-LLaMA Inference with Multi-token Prediction

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-13 Desh Raj , Gil Keren , Junteng Jia , Jay Mahadeokar , Ozlem Kalinli