Related papers: DEED: Dynamic Early Exit on Decoder for Accelerati…

You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Shengkun Tang , Yaqing Wang , Zhenglun Kong , Tianchi Zhang , Yao Li , Caiwen Ding , Yanzhi Wang , Yi Liang , Dongkuan Xu

DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Haibo HU , Lianming Huang , Nan Guan , Chun Jason Xue

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed…

Computation and Language · Computer Science 2025-08-08 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

Dynamic Encoder Transducer: A Flexible Solution For Trading Off Accuracy For Latency

We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET…

Computation and Language · Computer Science 2021-04-07 Yangyang Shi , Varun Nagaraja , Chunyang Wu , Jay Mahadeokar , Duc Le , Rohit Prabhavalkar , Alex Xiao , Ching-Feng Yeh , Julian Chan , Christian Fuegen , Ozlem Kalinli , Michael L. Seltzer

ADEPT: Adaptive Dynamic Early-Exit Process for Transformers

The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting…

Computation and Language · Computer Science 2026-01-08 Sangmin Yoo , Srikanth Malla , Chiho Choi , Wei D. Lu , Joon Hee Choi

Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead

Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early…

Machine Learning · Computer Science 2022-06-30 Arian Bakhtiarnia , Qi Zhang , Alexandros Iosifidis

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. However, they are also notorious for being slow in inference, which makes them difficult to deploy in real-time applications. We…

Computation and Language · Computer Science 2020-04-28 Ji Xin , Raphael Tang , Jaejun Lee , Yaoliang Yu , Jimmy Lin

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is…

Robotics · Computer Science 2025-06-17 Wenxuan Song , Jiayi Chen , Pengxiang Ding , Yuxin Huang , Han Zhao , Donglin Wang , Haoang Li

Efficient Encoder-Decoder Transformer Decoding for Decomposable Tasks

Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as…

Computation and Language · Computer Science 2024-11-19 Bo-Ru Lu , Nikita Haduong , Chien-Yu Lin , Hao Cheng , Noah A. Smith , Mari Ostendorf

BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts

Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose…

Machine Learning · Computer Science 2025-02-04 Divya Jyoti Bajpai , Manjesh Kumar Hanawal

Computational Efficient Width-Wise Early Exiting in Wireless Communication Systems

Deep learning (DL) techniques are increasingly pervasive across various domains, including wireless communication, where they extract insights from raw radio signals. However, the computational demands of DL pose significant challenges,…

Signal Processing · Electrical Eng. & Systems 2024-09-05 Dieter Verbruggen , Hazem Sallouha , Sofie Pollin

The Diminishing Returns of Early-Exit Decoding in Modern LLMs

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining…

Computation and Language · Computer Science 2026-03-26 Rui Wei , Rui Du , Hanfei Yu , Devesh Tiwari , Jian Li , Zhaozhuo Xu , Hao Wang

Efficient DETR: Improving End-to-End Object Detector with Dense Prior

The recently proposed end-to-end transformer detectors, such as DETR and Deformable DETR, have a cascade structure of stacking 6 decoder layers to update object queries iteratively, without which their performance degrades seriously. In…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Zhuyu Yao , Jiangbo Ai , Boxun Li , Chi Zhang

Accelerating Large Language Model Inference with Self-Supervised Early Exits

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

Multilingual Neural Machine Translation with Deep Encoder and Multiple Shallow Decoders

Recent work in multilingual translation advances translation quality surpassing bilingual baselines using deep transformer models with increased capacity. However, the extra latency and memory costs introduced by this approach may make it…

Computation and Language · Computer Science 2022-06-07 Xiang Kong , Adithya Renduchintala , James Cross , Yuqing Tang , Jiatao Gu , Xian Li

Dynamic Rebatching for Efficient Early-Exit Inference with DREX

Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Xuting Liu , Daniel Alexander , Siva Kesava Reddy Kakarla , Behnaz Arzani , Vincent Liu

Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the…

Computation and Language · Computer Science 2023-10-10 Sangmin Bae , Jongwoo Ko , Hwanjun Song , Se-Young Yun

Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers,…

Sound · Computer Science 2023-09-14 Zhengyang Chen , Bing Han , Shuai Wang , Yanmin Qian

Encoder-Decoder Diffusion Language Models for Efficient Training and Inference

Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network…

Machine Learning · Computer Science 2025-10-28 Marianne Arriola , Yair Schiff , Hao Phung , Aaron Gokaslan , Volodymyr Kuleshov