English
Related papers

Related papers: Generation Meets Verification: Accelerating Large …

200 papers

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising…

Computation and Language · Computer Science 2026-05-27 Kangyu Wang , Zhiyun Jiang , Haibo Feng , Weijia Zhao , Lin Liu , Jianguo Li , Zhenzhong Lan , Weiyao Lin

The performance of large language models (LLMs) is closely linked to their underlying size, leading to ever-growing networks and hence slower inference. Speculative decoding has been proposed as a technique to accelerate autoregressive…

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding…

Information Retrieval · Computer Science 2025-02-27 Xinyu Lin , Chaoqun Yang , Wenjie Wang , Yongqi Li , Cunxiao Du , Fuli Feng , See-Kiong Ng , Tat-Seng Chua

Large language models (LLMs) have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding…

Machine Learning · Computer Science 2024-07-18 Benjamin Bergner , Andrii Skliar , Amelie Royer , Tijmen Blankevoort , Yuki Asano , Babak Ehteshami Bejnordi

Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in…

Machine Learning · Computer Science 2025-05-27 Yixuan Wang , Yijun Liu , Shiyu ji , Yuzhuang Xu , Yang Xu , Qingfu Zhu , Wanxiang Che

Large language models (LLMs) often face a bottleneck in inference speed due to their reliance on auto-regressive decoding. Recently, parallel decoding has shown significant promise in enhancing inference efficiency. However, we have…

Computation and Language · Computer Science 2024-10-18 Yuxuan Liu , Wenyuan Li , Laizhong Cui , Hailiang Yang

As Large Language Models (LLMs) have made significant advancements across various tasks, such as question answering, translation, text summarization, and dialogue systems, the need for accuracy in information becomes crucial, especially for…

Information Retrieval · Computer Science 2024-05-31 Yao Zhao , Zhitian Xie , Chen Liang , Chenyi Zhuang , Jinjie Gu

We present a novel inference scheme, self-speculative decoding, for accelerating Large Language Models (LLMs) without the need for an auxiliary model. This approach is characterized by a two-stage process: drafting and verification. The…

Computation and Language · Computer Science 2025-02-11 Jun Zhang , Jue Wang , Huan Li , Lidan Shou , Ke Chen , Gang Chen , Sharad Mehrotra

To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts…

Computation and Language · Computer Science 2024-06-05 Heming Xia , Zhe Yang , Qingxiu Dong , Peiyi Wang , Yongqi Li , Tao Ge , Tianyu Liu , Wenjie Li , Zhifang Sui

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

We propose LLMA, an LLM accelerator to losslessly speed up Large Language Model (LLM) inference with references. LLMA is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the…

Computation and Language · Computer Science 2023-04-11 Nan Yang , Tao Ge , Liang Wang , Binxing Jiao , Daxin Jiang , Linjun Yang , Rangan Majumder , Furu Wei

In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels…

Computation and Language · Computer Science 2024-11-22 Jinghui Lu , Ziwei Yang , Yanjie Wang , Xuejing Liu , Brian Mac Namee , Can Huang

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-13 Desh Raj , Gil Keren , Junteng Jia , Jay Mahadeokar , Ozlem Kalinli

Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of…

Computation and Language · Computer Science 2026-02-05 Ximing Dong , Shaowei Wang , Dayi Lin , Boyuan Chen , Ahmed E. Hassan

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

We introduce PARSE (PArallel pRefix Speculative Engine), a speculative generation framework that accelerates large language model (LLM) inference by parallelizing prefix verification on a semantic level. Existing speculative decoding…

Artificial Intelligence · Computer Science 2026-05-07 Yuncheng Yao , Yuxuan Xia , Shengjie Wang , Danyang Zhuo

As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time…

Computation and Language · Computer Science 2026-02-11 Lingzhe Zhang , Liancheng Fang , Chiming Duan , Minghua He , Leyi Pan , Pei Xiao , Shiyu Huang , Yunpeng Zhai , Xuming Hu , Philip S. Yu , Aiwei Liu

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang
‹ Prev 1 2 3 10 Next ›