English
Related papers

Related papers: You Only Cache Once: Decoder-Decoder Architectures…

200 papers

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies…

Computation and Language · Computer Science 2026-04-02 Yutao Sun , Li Dong , Tianzhu Ye , Shaohan Huang , Jianyong Wang , Furu Wei

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible…

Computation and Language · Computer Science 2026-04-16 You Wu , Ziheng Chen , Yizhen Zhang , Haoyi Wu , Chengting Yu , Yuchi Xu , Wenbo Su , Bo Zheng , Kewei Tu

The sequence-to-sequence (seq2seq) task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by the Encoder-Decoder framework which requires an encoder to…

Computation and Language · Computer Science 2023-04-11 Zihao Fu , Wai Lam , Qian Yu , Anthony Man-Cho So , Shengding Hu , Zhiyuan Liu , Nigel Collier

Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone…

Machine Learning · Computer Science 2026-03-25 Hong Jeong

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the…

Computation and Language · Computer Science 2024-06-05 Haoyi Wu , Kewei Tu

State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms,…

Computation and Language · Computer Science 2022-04-26 Kai Hui , Honglei Zhuang , Tao Chen , Zhen Qin , Jing Lu , Dara Bahri , Ji Ma , Jai Prakash Gupta , Cicero Nogueira dos Santos , Yi Tay , Don Metzler

Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose…

Machine Learning · Computer Science 2026-05-20 Asher Labovich , Benjamin Bradley , Vanessa Alexander , Chaitanya Harsha

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components…

Computer Vision and Pattern Recognition · Computer Science 2025-10-08 Han Li , Xinyu Peng , Yaoming Wang , Zelin Peng , Xin Chen , Rongxiang Weng , Jingang Wang , Xunliang Cai , Wenrui Dai , Hongkai Xiong

While large language models are primarily used on natural language tasks, they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Most proposed approaches for such cross-modal…

Machine Learning · Computer Science 2026-03-09 Paloma García-de-Herreros , Philipp Slusallek , Dietrich Klakow , Vagrant Gautam

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-19 Xinyuan Zhou , Emre Yılmaz , Yanhua Long , Yijie Li , Haizhou Li

Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network…

Machine Learning · Computer Science 2025-10-28 Marianne Arriola , Yair Schiff , Hao Phung , Aaron Gokaslan , Volodymyr Kuleshov

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters…

Computation and Language · Computer Science 2025-01-31 Mohamed Elfeki , Rui Liu , Chad Voegele

Decoder-only large language models (LLMs) have been increasingly adopted to build embedding models for diverse tasks. To overcome the inherent limitations of causal attention in representation learning, many existing methods modify the…

Computation and Language · Computer Science 2026-05-05 Ailiang Lin , Zhuoyun Li , Yusong Wang , Kotaro Funakoshi , Manabu Okumura

Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Juntao Liu , Liqiang Niu , Wenchao Chen , Jie Zhou , Fandong Meng

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only…

Computation and Language · Computer Science 2023-05-26 Tianrui Wang , Long Zhou , Ziqiang Zhang , Yu Wu , Shujie Liu , Yashesh Gaur , Zhuo Chen , Jinyu Li , Furu Wei

Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of…

Machine Learning · Computer Science 2017-06-19 Lukasz Kaiser , Aidan N. Gomez , Noam Shazeer , Ashish Vaswani , Niki Parmar , Llion Jones , Jakob Uszkoreit

Recent advances in large-scale code generation models have led to remarkable progress in producing high-quality code. These models are trained in a self-supervised manner on extensive unlabeled code corpora using a decoder-only…

Software Engineering · Computer Science 2026-02-12 Jiayi Lin , Yanlin Wang , Yibiao Yang , Lei Zhang , Yutao Xie

We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block…

Machine Learning · Computer Science 2025-07-02 Ankit Kashyap

Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in…

Computation and Language · Computer Science 2025-09-23 Asif Shahriar , Rifat Shahriyar , M Saifur Rahman

Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-02 Emiru Tsunoo , Hayato Futami , Yosuke Kashiwagi , Siddhant Arora , Shinji Watanabe
‹ Prev 1 2 3 10 Next ›