Related papers: Fast Decoding in Sequence Models using Discrete La…

Blockwise Parallel Decoding for Deep Autoregressive Models

Deep autoregressive sequence-to-sequence models have demonstrated impressive performance across a wide variety of tasks in recent years. While common architecture classes such as recurrent, convolutional, and self-attention networks make…

Machine Learning · Computer Science 2018-11-09 Mitchell Stern , Noam Shazeer , Jakob Uszkoreit

Non-autoregressive Sequence-to-Sequence Vision-Language Models

Sequence-to-sequence vision-language models are showing promise, but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence…

Computer Vision and Pattern Recognition · Computer Science 2025-03-14 Kunyu Shi , Qi Dong , Luis Goncalves , Zhuowen Tu , Stefano Soatto

Fast Structured Decoding for Sequence Models

Autoregressive sequence models achieve state-of-the-art performance in domains like machine translation. However, due to the autoregressive factorization nature, these models suffer from heavy latency during inference. Recently,…

Machine Learning · Computer Science 2020-01-10 Zhiqing Sun , Zhuohan Li , Haoqing Wang , Zi Lin , Di He , Zhi-Hong Deng

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Deep Learning Method for Cell-Wise Object Tracking, Velocity Estimation and Projection of Sensor Data over Time

Current Deep Learning methods for environment segmentation and velocity estimation rely on Convolutional Recurrent Neural Networks to exploit spatio-temporal relationships within obtained sensor data. These approaches derive scene dynamics…

Computer Vision and Pattern Recognition · Computer Science 2023-06-21 Marco Braun , Moritz Luszek , Mirko Meuter , Dominic Spata , Kevin Kollek , Anton Kummert

Auxiliary Guided Autoregressive Variational Autoencoders

Generative modeling of high-dimensional data is a key problem in machine learning. Successful approaches include latent variable models and autoregressive models. The complementary strengths of these approaches, to model global and local…

Computer Vision and Pattern Recognition · Computer Science 2019-04-19 Thomas Lucas , Jakob Verbeek

Accelerating Transformer Decoding via a Hybrid of Self-attention and Recurrent Neural Network

Due to the highly parallelizable architecture, Transformer is faster to train than RNN-based models and popularly used in machine translation tasks. However, at inference time, each output word requires all the hidden states of the…

Computation and Language · Computer Science 2019-09-06 Chengyi Wang , Shuangzhi Wu , Shujie Liu

Deconvolutional Latent-Variable Model for Text Sequence Matching

A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we…

Computation and Language · Computer Science 2017-11-23 Dinghan Shen , Yizhe Zhang , Ricardo Henao , Qinliang Su , Lawrence Carin

PaDeLLM-NER: Parallel Decoding in Large Language Models for Named Entity Recognition

In this study, we aim to reduce generation latency for Named Entity Recognition (NER) with Large Language Models (LLMs). The main cause of high latency in LLMs is the sequential decoding process, which autoregressively generates all labels…

Computation and Language · Computer Science 2024-11-22 Jinghui Lu , Ziwei Yang , Yanjie Wang , Xuejing Liu , Brian Mac Namee , Can Huang

Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference Using a Delta Posterior

Although neural machine translation models reached high translation quality, the autoregressive nature makes inference difficult to parallelize and leads to high translation latency. Inspired by recent refinement-based approaches, we…

Computation and Language · Computer Science 2019-11-22 Raphael Shu , Jason Lee , Hideki Nakayama , Kyunghyun Cho

End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification

Autoregressive decoding is the only part of sequence-to-sequence models that prevents them from massive parallelization at inference time. Non-autoregressive models enable the decoder to generate all output symbols independently in…

Computation and Language · Computer Science 2018-11-13 Jindřich Libovický , Jindřich Helcl

Accelerating Transformer Inference for Translation via Parallel Decoding

Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the…

Computation and Language · Computer Science 2025-02-06 Andrea Santilli , Silvio Severino , Emilian Postolache , Valentino Maiorca , Michele Mancusi , Riccardo Marin , Emanuele Rodolà

Fast and Fluent Diffusion Language Models via Convolutional Decoding and Rejective Fine-tuning

Autoregressive (AR) language models generate text one token at a time, which limits their inference speed. Diffusion-based language models offer a promising alternative, as they can decode multiple tokens in parallel. However, we identify a…

Computation and Language · Computer Science 2025-10-27 Yeongbin Seo , Dongha Lee , Jaehyung Kim , Jinyoung Yeo

Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation

Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where…

Computation and Language · Computer Science 2021-06-28 Jungo Kasai , Nikolaos Pappas , Hao Peng , James Cross , Noah A. Smith

On Recurrent Neural Networks for Sequence-based Processing in Communications

In this work, we analyze the capabilities and practical limitations of neural networks (NNs) for sequence-based signal processing which can be seen as an omnipresent property in almost any modern communication systems. In particular, we…

Information Theory · Computer Science 2019-11-22 Daniel Tandler , Sebastian Dörner , Sebastian Cammerer , Stephan ten Brink

Fast Inference from Transformers via Speculative Decoding

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any…

Machine Learning · Computer Science 2023-05-22 Yaniv Leviathan , Matan Kalman , Yossi Matias

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Continuous Speculative Decoding for Autoregressive Image Generation

Continuous visual autoregressive (AR) models have demonstrated promising performance in image generation. However, the heavy autoregressive inference burden imposes significant overhead. In Large Language Models (LLMs), speculative decoding…

Computer Vision and Pattern Recognition · Computer Science 2025-09-30 Zili Wang , Robert Zhang , Kun Ding , Qi Yang , Fei Li , Shiming Xiang

Accelerating Inference of Discrete Autoregressive Normalizing Flows by Selective Jacobi Decoding

Discrete normalizing flows are promising generative models with advantages such as analytical log-likelihood computation and end-to-end training. However, the architectural constraints to ensure invertibility and tractable Jacobian…

Machine Learning · Computer Science 2026-05-06 Jiaru Zhang , Juanwu Lu , Xiaoyu Wu , Ziran Wang , Ruqi Zhang

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang