Related papers: Parallel Decoder Transformer: Planner-Seeded Laten…

Parallel Token Prediction for Language Models

Autoregressive decoding in language models is inherently slow, generating only one token per forward pass. We propose Parallel Token Prediction (PTP), a general-purpose framework for predicting multiple tokens in a single model call. PTP…

Computation and Language · Computer Science 2026-03-06 Felix Draxler , Justus Will , Farrin Marouf Sofian , Theofanis Karaletsos , Sameer Singh , Stephan Mandt

Decoder Tuning: Efficient Language Understanding as Decoding

With the evergrowing sizes of pre-trained models (PTMs), it has been an emerging practice to only provide the inference APIs for users, namely model-as-a-service (MaaS) setting. To adapt PTMs with model parameters frozen, most current…

Computation and Language · Computer Science 2023-05-25 Ganqu Cui , Wentao Li , Ning Ding , Longtao Huang , Zhiyuan Liu , Maosong Sun

DDT: Decoupled Diffusion Transformer

Diffusion transformers have demonstrated remarkable generation quality, albeit requiring longer training iterations and numerous inference steps. In each denoising step, diffusion transformers encode the noisy inputs to extract the…

Computer Vision and Pattern Recognition · Computer Science 2025-04-10 Shuai Wang , Zhi Tian , Weilin Huang , Limin Wang

Parallel Track Transformers: Enabling Fast GPU Inference with Reduced Synchronization

Efficient large-scale inference of transformer-based large language models (LLMs) remains a fundamental systems challenge, frequently requiring multi-GPU parallelism to meet stringent latency and throughput targets. Conventional tensor…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-10 Chong Wang , Nan Du , Tom Gunter , Tao Lei , Kulin Seth , Senyu Tong , Jianyu Wang , Guoli Yin , Xiyou Zhou , Kelvin Zou , Ruoming Pang

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Prompt Tuning Decision Transformers with Structured and Scalable Bandits

Prompt tuning has emerged as a key technique for adapting large pre-trained Decision Transformers (DTs) in offline Reinforcement Learning (RL), particularly in multi-task and few-shot settings. The Prompting Decision Transformer (PDT)…

Machine Learning · Computer Science 2025-10-02 Finn Rietz , Oleg Smirnov , Sara Karimi , Lele Cao

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as…

Computation and Language · Computer Science 2024-11-19 Bo-Ru Lu , Nikita Haduong , Chien-Yu Lin , Hao Cheng , Noah A. Smith , Mari Ostendorf

Transducers with Pronunciation-aware Embeddings for Automatic Speech Recognition

This paper proposes Transducers with Pronunciation-aware Embeddings (PET). Unlike conventional Transducers where the decoder embeddings for different tokens are trained independently, the PET model's decoder embedding incorporates shared…

Computation and Language · Computer Science 2024-04-09 Hainan Xu , Zhehuai Chen , Fei Jia , Boris Ginsburg

Decoder-only Streaming Transformer for Simultaneous Translation

Simultaneous Machine Translation (SiMT) generates translation while reading source tokens, essentially producing the target prefix based on the source prefix. To achieve good performance, it leverages the relationship between source and…

Computation and Language · Computer Science 2024-06-07 Shoutao Guo , Shaolei Zhang , Yang Feng

Prompt Guided Transformer for Multi-Task Dense Prediction

Task-conditional architecture offers advantage in parameter efficiency but falls short in performance compared to state-of-the-art multi-decoder methods. How to trade off performance and model parameters is an important and difficult…

Computer Vision and Pattern Recognition · Computer Science 2023-07-31 Yuxiang Lu , Shalayiding Sirejiding , Yue Ding , Chunlin Wang , Hongtao Lu

SDPT: Synchronous Dual Prompt Tuning for Fusion-based Visual-Language Pre-trained Models

Prompt tuning methods have achieved remarkable success in parameter-efficient fine-tuning on large pre-trained models. However, their application to dual-modal fusion-based visual-language pre-trained models (VLPMs), such as GLIP, has…

Computer Vision and Pattern Recognition · Computer Science 2024-07-17 Yang Zhou , Yongjian Wu , Jiya Saiyin , Bingzheng Wei , Maode Lai , Eric Chang , Yan Xu

Latent-attention Based Transformer for Near ML Polar Decoding in Short-code Regime

Transformer architectures have emerged as promising deep learning (DL) tools for modeling complex sequence-to-sequence interactions in channel decoding. However, current transformer-based decoders for error correction codes (ECCs)…

Signal Processing · Electrical Eng. & Systems 2025-07-22 Hongzhi Zhu , Wei Xu , Xiaohu You

DeMPT: Decoding-enhanced Multi-phase Prompt Tuning for Making LLMs Be Better Context-aware Translators

Generally, the decoder-only large language models (LLMs) are adapted to context-aware neural machine translation (NMT) in a concatenating way, where LLMs take the concatenation of the source sentence (i.e., intra-sentence context) and the…

Computation and Language · Computer Science 2024-09-24 Xinglin Lyu , Junhui Li , Yanqing Zhao , Min Zhang , Daimeng Wei , Shimin Tao , Hao Yang , Min Zhang

Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this…

Computation and Language · Computer Science 2025-10-30 Bohong Wu , Mengzhao Chen , Xiang Luo , Shen Yan , Qifan Yu , Fan Xia , Tianqi Zhang , Hongrui Zhan , Zheng Zhong , Xun Zhou , Siyuan Qiao , Xingyan Bin

Accelerating Transformer Inference for Translation via Parallel Decoding

Autoregressive decoding limits the efficiency of transformers for Machine Translation (MT). The community proposed specific network architectures and learning-based methods to solve this issue, which are expensive and require changes to the…

Computation and Language · Computer Science 2025-02-06 Andrea Santilli , Silvio Severino , Emilian Postolache , Valentino Maiorca , Michele Mancusi , Riccardo Marin , Emanuele Rodolà

ParallelSpec: Parallel Drafter for Efficient Speculative Decoding

Speculative decoding has proven to be an efficient solution to large language model (LLM) inference, where the small drafter predicts future tokens at a low cost, and the target model is leveraged to verify them in parallel. However, most…

Computation and Language · Computer Science 2024-10-10 Zilin Xiao , Hongming Zhang , Tao Ge , Siru Ouyang , Vicente Ordonez , Dong Yu

Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et…

Computation and Language · Computer Science 2020-11-21 Hang Le , Juan Pino , Changhan Wang , Jiatao Gu , Didier Schwab , Laurent Besacier

Efficient Document Parsing via Parallel Token Prediction

Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing…

Computation and Language · Computer Science 2026-03-17 Lei Li , Ze Zhao , Meng Li , Zhongwang Lun , Yi Yuan , Xingjing Lu , Zheng Wei , Jiang Bian , Zang Li

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source…

Machine Learning · Computer Science 2026-05-27 Zeyi Huang , Xuehai He , LiLiang Ren , Yiping Wang , Baolin Peng , Hao Cheng , Shuohang Wang , Pengcheng He , Jianfeng Gao , Yong Jae Lee , Yelong Shen