Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Brian J Chan; MaoXun Huang; Jui-Hung Cheng; Chao-Ting Chen; Hen-Hsen Huang

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Computation and Language 2025-09-23 v2

Authors: Brian J Chan , MaoXun Huang , Jui-Hung Cheng , Chao-Ting Chen , Hen-Hsen Huang

Abstract

This work presents a novel trie (prefix-tree)-based parallel decoding method that addresses the memory inefficiency of batch-based beam search. By sharing a single KV cache across beams with common prefixes, our approach dramatically reduces memory usage and enables efficient decoding. We evaluated our method across three attention architectures, Multi-Head Attention (Phi-3.5-mini-instruct), Grouped Query Attention (Llama-3.1-8B-Instruct), and Sliding Window Attention (Mistral-Small-24B-Instruct-2501), using CNN/DailyMail for abstractive summarization and HumanEval for code generation. Our experiments demonstrate substantial memory savings (4--8 $\times$ ) and up to 2.4 $\times$ faster decoding, without compromising generation quality. These results highlight our method's suitability for memory-constrained environments and large-scale deployments.

Keywords

key-value cache attention mechanism encoder-decoder architecture

Cite

@article{arxiv.2502.00085,
  title  = {Efficient Beam Search for Large Language Models Using Trie-Based Decoding},
  author = {Brian J Chan and MaoXun Huang and Jui-Hung Cheng and Chao-Ting Chen and Hen-Hsen Huang},
  journal= {arXiv preprint arXiv:2502.00085},
  year   = {2025}
}

Comments

13 pages, accepted as a main conference paper at EMNLP 2025

Efficient Beam Search for Large Language Models Using Trie-Based Decoding

Abstract

Keywords

Cite

Comments

Related papers