Related papers: Direct Multi-Token Decoding

Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing

Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models…

Computation and Language · Computer Science 2024-06-26 Viet Anh Trinh , Rosy Southwell , Yiwen Guan , Xinlu He , Zhiyong Wang , Jacob Whitehill

Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration

Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly…

Computation and Language · Computer Science 2024-04-19 Pengfei Wu , Jiahao Liu , Zhuocheng Gong , Qifan Wang , Jinpeng Li , Jingang Wang , Xunliang Cai , Dongyan Zhao

Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems…

Computation and Language · Computer Science 2025-06-03 Yingfeng Luo , Tong Zheng , Yongyu Mu , Bei Li , Qinghong Zhang , Yongqi Gao , Ziqiang Xu , Peinan Feng , Xiaoqian Liu , Tong Xiao , Jingbo Zhu

DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers

In recent years, many interpretability methods have been proposed to help interpret the internal states of Transformer-models, at different levels of precision and complexity. Here, to analyze encoder-decoder Transformers, we propose a…

Computation and Language · Computer Science 2024-04-04 Anna Langedijk , Hosein Mohebbi , Gabriele Sarti , Willem Zuidema , Jaap Jumelet

Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders

Recent work in multilingual translation advances translation quality surpassing bilingual baselines using deep transformer models with increased capacity. However, the extra latency and memory costs introduced by this approach may make it…

Computation and Language · Computer Science 2022-06-07 Xiang Kong , Adithya Renduchintala , James Cross , Yuqing Tang , Jiatao Gu , Xian Li

DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding

Hyperscaling of data and parameter count in LLMs is yielding diminishing improvement when weighed against training costs, underlining a growing need for more efficient finetuning and inference without sacrificing performance. This is…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Moulik Choraria , Xinbo Wu , Akhil Bhimaraju , Nitesh Sekhar , Yue Wu , Xu Zhang , Prateek Singhal , Lav R. Varshney

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized…

Computation and Language · Computer Science 2024-08-23 Parishad BehnamGhader , Vaibhav Adlakha , Marius Mosbach , Dzmitry Bahdanau , Nicolas Chapados , Siva Reddy

Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does…

Computation and Language · Computer Science 2026-02-10 Abir Harrasse , Florent Draye , Punya Syon Pandey , Zhijing Jin , Bernhard Schölkopf

Attend First, Consolidate Later: On the Importance of Attention in Different LLM Layers

In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the…

Computation and Language · Computer Science 2024-11-01 Amit Ben-Artzy , Roy Schwartz

InnerThoughts: Disentangling Representations and Predictions in Large Language Models

Large language models (LLMs) contain substantial factual knowledge which is commonly elicited by multiple-choice question-answering prompts. Internally, such models process the prompt through multiple transformer layers, building varying…

Computation and Language · Computer Science 2025-01-31 Didier Chételat , Joseph Cotnareanu , Rylee Thompson , Yingxue Zhang , Mark Coates

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating…

Computation and Language · Computer Science 2024-07-04 Chao-Wei Huang , Hui Lu , Hongyu Gong , Hirofumi Inaguma , Ilia Kulikov , Ruslan Mavlyutov , Sravya Popuri

Speculative Decoding in Decentralized LLM Inference: Turning Communication Latency into Computation Throughput

Speculative decoding accelerates large language model (LLM) inference by using a lightweight draft model to propose tokens that are later verified by a stronger target model. While effective in centralized systems, its behavior in…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-11-18 Jingwei Song , Wanyi Chen , Xinyuan Song , Max , Chris Tong , Gufeng Chen , Tianyi Zhao , Eric Yang , Bill Shi , Lynn Ai

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

Enhancing Latent Computation in Transformers with Latent Tokens

Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be…

Machine Learning · Computer Science 2025-05-20 Yuchang Sun , Yanxi Chen , Yaliang Li , Bolin Ding

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang

Deep Transformers with Latent Depth

The Transformer model has achieved state-of-the-art performance in many sequence modeling tasks. However, how to leverage model capacity with large or variable depths is still an open challenge. We present a probabilistic framework to…

Computation and Language · Computer Science 2020-10-19 Xian Li , Asa Cooper Stickland , Yuqing Tang , Xiang Kong

Layer by Layer: Uncovering Hidden Representations in Language Models

From extracting features to generating text, the outputs of large language models (LLMs) typically rely on the final layers, following the conventional wisdom that earlier layers capture only low-level cues. However, our analysis shows that…

Machine Learning · Computer Science 2025-06-17 Oscar Skean , Md Rifat Arefin , Dan Zhao , Niket Patel , Jalal Naghiyev , Yann LeCun , Ravid Shwartz-Ziv

The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives

We seek to understand how the representations of individual tokens and the structure of the learned feature space evolve between layers in deep neural networks under different learning objectives. We focus on the Transformers for our…

Computation and Language · Computer Science 2019-09-05 Elena Voita , Rico Sennrich , Ivan Titov

ES-dLLM: Efficient Inference for Diffusion Large Language Models by Early-Skipping

Diffusion large language models (dLLMs) are emerging as a promising alternative to autoregressive models (ARMs) due to their ability to capture bidirectional context and the potential for parallel generation. Despite the advantages, dLLM…

Machine Learning · Computer Science 2026-03-12 Zijian Zhu , Fei Ren , Zhanhong Tan , Kaisheng Ma

Learning to Decode Collaboratively with Multiple Language Models

We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level. We model the decision of which LLM generates the next token as a latent variable. By optimizing the…

Computation and Language · Computer Science 2024-08-28 Shannon Zejiang Shen , Hunter Lang , Bailin Wang , Yoon Kim , David Sontag