Related papers: Efficient Beam Search for Large Language Models Us…

DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference

Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding,…

Computation and Language · Computer Science 2025-03-10 Jinwei Yao , Kaiqi Chen , Kexun Zhang , Jiaxuan You , Binhang Yuan , Zeke Wang , Tao Lin

Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters

Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs…

Machine Learning · Computer Science 2025-02-11 Vasudev Shyam , Jonathan Pilault , Emily Shepperd , Quentin Anthony , Beren Millidge

Bifurcated Attention: Accelerating Massively Parallel Decoding with Shared Prefixes in LLMs

This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing…

Machine Learning · Computer Science 2024-07-15 Ben Athiwaratkun , Sujan Kumar Gonugondla , Sanjay Krishna Gouda , Haifeng Qian , Hantian Ding , Qing Sun , Jun Wang , Jiacheng Guo , Liangfu Chen , Parminder Bhatia , Ramesh Nallapati , Sudipta Sengupta , Bing Xiang

A Streaming Approach For Efficient Batched Beam Search

We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically "refills" the batch before…

Computation and Language · Computer Science 2021-08-17 Kevin Yang , Violet Yao , John DeNero , Dan Klein

Joint Beam Search Integrating CTC, Attention, and Transducer Decoders

End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-15 Yui Sudo , Muhammad Shakeel , Yosuke Fukumoto , Brian Yan , Jiatong Shi , Yifan Peng , Shinji Watanabe

A Stable and Effective Learning Strategy for Trainable Greedy Decoding

Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation. However, this improvement comes at substantial computational cost.…

Computation and Language · Computer Science 2018-08-29 Yun Chen , Victor O. K. Li , Kyunghyun Cho , Samuel R. Bowman

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models

Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do…

Machine Learning · Computer Science 2017-10-10 Kartik Goyal , Graham Neubig , Chris Dyer , Taylor Berg-Kirkpatrick

CoDec: Prefix-Shared Decoding Kernel for LLMs

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a…

Machine Learning · Computer Science 2026-03-31 Zhibin Wang , Rui Ning , Chao Fang , Zhonghui Zhang , Xi Lin , Shaobo Ma , Mo Zhou , Xue Li , Zhongfeng Wang , Chengying Huan , Rong Gu , Kun Yang , Guihai Chen , Sheng Zhong , Chen Tian

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g.,…

Computation and Language · Computer Science 2016-05-13 Kyunghyun Cho

Efficient Attention using a Fixed-Size Memory Representation

The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an…

Computation and Language · Computer Science 2017-07-04 Denny Britz , Melody Y. Guan , Minh-Thang Luong

LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this…

Computation and Language · Computer Science 2026-04-09 Penghui Yang , Cunxiao Du , Fengzhuo Zhang , Haonan Wang , Tianyu Pang , Chao Du , Bo An

Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-24 Wei Zhou , Ralf Schlüter , Hermann Ney

PAT: Accelerating LLM Decoding via Prefix-Aware Attention with Resource Efficient Multi-Tile Kernel

LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Jinjun Yi , Zhixin Zhao , Yitao Hu , Ke Yan , Weiwei Sun , Hao Wang , Laiping Zhao , Yuhao Zhang , Wenxin Li , Keqiu Li

A Token-Wise Beam Search Algorithm for RNN-T

Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large…

Machine Learning · Computer Science 2023-10-09 Gil Keren

Attention Tree: Learning Hierarchies of Visual Features for Large-Scale Image Recognition

One of the key challenges in machine learning is to design a computationally efficient multi-class classifier while maintaining the output accuracy and performance. In this paper, we present a tree-based classifier: Attention Tree (ATree)…

Computer Vision and Pattern Recognition · Computer Science 2016-08-03 Priyadarshini Panda , Kaushik Roy

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam…

Information Retrieval · Computer Science 2026-05-26 Kidist Amde Mekonnen , Yongkang Li , Yubao Tang , Simon Lupart , Maarten de Rijke

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-25 Emiru Tsunoo , Hayato Futami , Yosuke Kashiwagi , Siddhant Arora , Shinji Watanabe

Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Zhendong Zhang

A Fully Differentiable Beam Search Decoder

We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g.…

Computation and Language · Computer Science 2019-02-19 Ronan Collobert , Awni Hannun , Gabriel Synnaeve

BSDAR: Beam Search Decoding with Attention Reward in Neural Keyphrase Generation

This study mainly investigates two common decoding problems in neural keyphrase generation: sequence length bias and beam diversity. To tackle the problems, we introduce a beam search decoding strategy based on word-level and ngram-level…

Computation and Language · Computer Science 2023-10-31 Iftitahu Ni'mah , Vlado Menkovski , Mykola Pechenizkiy