English
Related papers

Related papers: Efficient Beam Search for Large Language Models Us…

200 papers

Large language models (LLMs) are increasingly employed for complex tasks that process multiple generation calls in a tree structure with shared prefixes of tokens, including few-shot prompting, multi-step reasoning, speculative decoding,…

Computation and Language · Computer Science 2025-03-10 Jinwei Yao , Kaiqi Chen , Kexun Zhang , Jiaxuan You , Binhang Yuan , Zeke Wang , Tao Lin

Our formulation reveals that the reduction across the sequence axis can be efficiently computed in parallel through a tree reduction. Our algorithm, called Tree Attention, for parallelizing exact attention computation across multiple GPUs…

Machine Learning · Computer Science 2025-02-11 Vasudev Shyam , Jonathan Pilault , Emily Shepperd , Quentin Anthony , Beren Millidge

This study introduces bifurcated attention, a method designed to enhance language model inference in shared-context batch decoding scenarios. Our approach addresses the challenge of redundant memory IO costs, a critical factor contributing…

We propose an efficient batching strategy for variable-length decoding on GPU architectures. During decoding, when candidates terminate or are pruned according to heuristics, our streaming approach periodically "refills" the batch before…

Computation and Language · Computer Science 2021-08-17 Kevin Yang , Violet Yao , John DeNero , Dan Klein

End-to-end automatic speech recognition (E2E-ASR) can be classified by its decoder architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-15 Yui Sudo , Muhammad Shakeel , Yosuke Fukumoto , Brian Yan , Jiatong Shi , Yifan Peng , Shinji Watanabe

Beam search is a widely used approximate search strategy for neural network decoders, and it generally outperforms simple greedy decoding on tasks like machine translation. However, this improvement comes at substantial computational cost.…

Computation and Language · Computer Science 2018-08-29 Yun Chen , Victor O. K. Li , Kyunghyun Cho , Samuel R. Bowman

Beam search is a desirable choice of test-time decoding algorithm for neural sequence models because it potentially avoids search errors made by simpler greedy methods. However, typical cross entropy training procedures for these models do…

Machine Learning · Computer Science 2017-10-10 Kartik Goyal , Graham Neubig , Chris Dyer , Taylor Berg-Kirkpatrick

Prefix-sharing among multiple prompts presents opportunities to combine the operations of the shared prefix, while attention computation in the decode stage, which becomes a critical bottleneck with increasing context lengths, is a…

Recent advances in conditional recurrent language modelling have mainly focused on network architectures (e.g., attention mechanism), learning algorithms (e.g., scheduled sampling and sequence-level training) and novel applications (e.g.,…

Computation and Language · Computer Science 2016-05-13 Kyunghyun Cho

The standard content-based attention mechanism typically used in sequence-to-sequence models is computationally expensive as it requires the comparison of large encoder and decoder states at each time step. In this work, we propose an…

Computation and Language · Computer Science 2017-07-04 Denny Britz , Melody Y. Guan , Minh-Thang Luong

As Large Language Models (LLMs) can now process extremely long contexts, efficient inference over these extended inputs has become increasingly important, especially for emerging applications like LLM agents that highly depend on this…

Computation and Language · Computer Science 2026-04-09 Penghui Yang , Cunxiao Du , Fengzhuo Zhang , Haonan Wang , Tianyu Pang , Chao Du , Bo An

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-24 Wei Zhou , Ralf Schlüter , Hermann Ney

LLM serving is increasingly dominated by decode attention, which is a memory-bound operation due to massive KV cache loading from global memory. Meanwhile, real-world workloads exhibit substantial, hierarchical shared prefixes across…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-03-17 Jinjun Yi , Zhixin Zhao , Yitao Hu , Ke Yan , Weiwei Sun , Hao Wang , Laiping Zhao , Yuhao Zhang , Wenxin Li , Keqiu Li

Standard Recurrent Neural Network Transducers (RNN-T) decoding algorithms for speech recognition are iterating over the time axis, such that one time step is decoded before moving on to the next time step. Those algorithms result in a large…

Machine Learning · Computer Science 2023-10-09 Gil Keren

One of the key challenges in machine learning is to design a computationally efficient multi-class classifier while maintaining the output accuracy and performance. In this paper, we present a tree-based classifier: Attention Tree (ATree)…

Computer Vision and Pattern Recognition · Computer Science 2016-08-03 Priyadarshini Panda , Kaushik Roy

Generative retrieval (GR) ranks documents by autoregressively generating document identifiers. Because many GR methods rely on trie-constrained beam search, they are vulnerable to early pruning of relevant prefixes under finite-beam…

Information Retrieval · Computer Science 2026-05-26 Kidist Amde Mekonnen , Yongkang Li , Yubao Tang , Simon Lupart , Maarten de Rijke

Although frame-based models, such as CTC and transducers, have an affinity for streaming automatic speech recognition, their decoding uses no future knowledge, which could lead to incorrect pruning. Conversely, label-based attention…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-25 Emiru Tsunoo , Hayato Futami , Yosuke Kashiwagi , Siddhant Arora , Shinji Watanabe

Multiple heads decoding accelerates the inference of Large Language Models (LLMs) by predicting next several tokens simultaneously. It generates and verifies multiple candidate sequences in parallel via tree attention with a fixed…

Computer Vision and Pattern Recognition · Computer Science 2025-02-11 Zhendong Zhang

We introduce a new beam search decoder that is fully differentiable, making it possible to optimize at training time through the inference procedure. Our decoder allows us to combine models which operate at different granularities (e.g.…

Computation and Language · Computer Science 2019-02-19 Ronan Collobert , Awni Hannun , Gabriel Synnaeve

This study mainly investigates two common decoding problems in neural keyphrase generation: sequence length bias and beam diversity. To tackle the problems, we introduce a beam search decoding strategy based on word-level and ngram-level…

Computation and Language · Computer Science 2023-10-31 Iftitahu Ni'mah , Vlado Menkovski , Mykola Pechenizkiy
‹ Prev 1 2 3 10 Next ›