Related papers: You Only Cache Once: Decoder-Decoder Architectures…

Universal YOCO for Efficient Depth Scaling

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies…

Computation and Language · Computer Science 2026-04-02 Yutao Sun , Li Dong , Tianzhu Ye , Shaohan Huang , Jianyong Wang , Furu Wei

YOCO++: Enhancing YOCO with KV Residual Connections for Efficient LLM Inference

Cross-layer key-value (KV) compression has been found to be effective in efficient inference of large language models (LLMs). Although they reduce the memory consumption of the KV cache, such methods usually introduce non-negligible…

Computation and Language · Computer Science 2026-04-16 You Wu , Ziheng Chen , Yizhen Zhang , Haoyi Wu , Chengting Yu , Yuchi Xu , Wenbo Su , Bo Zheng , Kewei Tu

Decoder-Only or Encoder-Decoder? Interpreting Language Model as a Regularized Encoder-Decoder

The sequence-to-sequence (seq2seq) task aims at generating the target sequence based on the given input source sequence. Traditionally, most of the seq2seq task is resolved by the Encoder-Decoder framework which requires an encoder to…

Computation and Language · Computer Science 2023-04-11 Zihao Fu , Wai Lam , Qian Yu , Anthony Man-Cho So , Shengding Hu , Zhiyuan Liu , Nigel Collier

Trained Persistent Memory for Frozen Decoder-Only LLMs

Decoder-only language models are stateless: hidden representations are discarded after every forward pass and nothing persists across sessions. Jeong (2026a) showed that trained memory adapters give a frozen encoder-decoder backbone…

Machine Learning · Computer Science 2026-03-25 Hong Jeong

Layer-Condensed KV Cache for Efficient Inference of Large Language Models

Huge memory consumption has been a major bottleneck for deploying high-throughput large language models in real-world applications. In addition to the large number of parameters, the key-value (KV) cache for the attention mechanism in the…

Computation and Language · Computer Science 2024-06-05 Haoyi Wu , Kewei Tu

ED2LM: Encoder-Decoder to Language Model for Faster Document Re-ranking Inference

State-of-the-art neural models typically encode document-query pairs using cross-attention for re-ranking. To this end, models generally utilize an encoder-only (like BERT) paradigm or an encoder-decoder (like T5) approach. These paradigms,…

Computation and Language · Computer Science 2022-04-26 Kai Hui , Honglei Zhuang , Tao Chen , Zhen Qin , Jing Lu , Dara Bahri , Ji Ma , Jai Prakash Gupta , Cicero Nogueira dos Santos , Yi Tay , Don Metzler

Block-Based Double Decoders

Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose…

Machine Learning · Computer Science 2026-05-20 Asher Labovich , Benjamin Bradley , Vanessa Alexander , Chaitanya Harsha

OneCAT: Decoder-Only Auto-Regressive Model for Unified Understanding and Generation

We introduce OneCAT, a unified multimodal model that seamlessly integrates understanding, generation, and editing within a novel, pure decoder-only transformer architecture. Our framework uniquely eliminates the need for external components…

Computer Vision and Pattern Recognition · Computer Science 2025-10-08 Han Li , Xinyu Peng , Yaoming Wang , Zelin Peng , Xin Chen , Rongxiang Weng , Jingang Wang , Xunliang Cai , Wenrui Dai , Hongkai Xiong

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

While large language models are primarily used on natural language tasks, they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Most proposed approaches for such cross-modal…

Machine Learning · Computer Science 2026-03-09 Paloma García-de-Herreros , Philipp Slusallek , Dietrich Klakow , Vagrant Gautam

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-19 Xinyuan Zhou , Emre Yılmaz , Yanhua Long , Yijie Li , Haizhou Li

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network…

Machine Learning · Computer Science 2025-10-28 Marianne Arriola , Yair Schiff , Hao Phung , Aaron Gokaslan , Volodymyr Kuleshov

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters…

Computation and Language · Computer Science 2025-01-31 Mohamed Elfeki , Rui Liu , Chad Voegele

Causal2Vec: Improving Decoder-only LLMs as Embedding Models through a Contextual Token

Decoder-only large language models (LLMs) have been increasingly adopted to build embedding models for diverse tasks. To overcome the inherent limitations of causal attention in representation learning, many existing methods modify the…

Computation and Language · Computer Science 2026-05-05 Ailiang Lin , Zhuoyun Li , Yusong Wang , Kotaro Funakoshi , Manabu Okumura

LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal Large Language Models

Existing visual token compression methods for Multimodal Large Language Models (MLLMs) predominantly operate as post-encoder modules, limiting their potential for efficiency gains. To address this limitation, we propose LaCo (Layer-wise…

Computer Vision and Pattern Recognition · Computer Science 2025-07-04 Juntao Liu , Liqiang Niu , Wenchao Chen , Jie Zhou , Fandong Meng

VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

Recent research shows a big convergence in model architecture, training objectives, and inference methods across various tasks for different modalities. In this paper, we propose VioLA, a single auto-regressive Transformer decoder-only…

Computation and Language · Computer Science 2023-05-26 Tianrui Wang , Long Zhou , Ziqiang Zhang , Yu Wu , Shujie Liu , Yashesh Gaur , Zhuo Chen , Jinyu Li , Furu Wei

One Model To Learn Them All

Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of…

Machine Learning · Computer Science 2017-06-19 Lukasz Kaiser , Aidan N. Gomez , Noam Shazeer , Ashish Vaswani , Niki Parmar , Llion Jones , Jakob Uszkoreit

Towards Better Code Understanding in Decoder-Only Models with Contrastive Learning

Recent advances in large-scale code generation models have led to remarkable progress in producing high-quality code. These models are trained in a self-supervised manner on extensive unlabeled code corpora using a decoder-only…

Software Engineering · Computer Science 2026-02-12 Jiayi Lin , Yanlin Wang , Yibiao Yang , Lei Zhang , Yutao Xie

Recurrent Memory-Augmented Transformers with Chunked Attention for Long-Context Language Modeling

We present a Transformer architecture for long-context language modeling that combines global attention with two biologically inspired components: chunked local attention and a gated FIFO memory mechanism. This unified attention block…

Machine Learning · Computer Science 2025-07-02 Ankit Kashyap

Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in…

Computation and Language · Computer Science 2025-09-23 Asif Shahriar , Rifat Shahriyar , M Saifur Rahman

Decoder-only Architecture for Streaming End-to-end Speech Recognition

Decoder-only language models (LMs) have been successfully adopted for speech-processing tasks including automatic speech recognition (ASR). The LMs have ample expressiveness and perform efficiently. This efficiency is a suitable…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-02 Emiru Tsunoo , Hayato Futami , Yosuke Kashiwagi , Siddhant Arora , Shinji Watanabe