English
Related papers

Related papers: DEED: Dynamic Early Exit on Decoder for Accelerati…

200 papers

Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Shengkun Tang , Yaqing Wang , Zhenglun Kong , Tianchi Zhang , Yao Li , Caiwen Ding , Yanzhi Wang , Yi Liang , Dongkuan Xu

Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free,…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Haibo HU , Lianming Huang , Nan Guan , Chun Jason Xue

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed…

Computation and Language · Computer Science 2025-08-08 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

We propose a dynamic encoder transducer (DET) for on-device speech recognition. One DET model scales to multiple devices with different computation capacities without retraining or finetuning. To trading off accuracy and latency, DET…

The inference of large language models imposes significant computational workloads, often requiring the processing of billions of parameters. Although early-exit strategies have proven effective in reducing computational demands by halting…

Computation and Language · Computer Science 2026-01-08 Sangmin Yoo , Srikanth Malla , Chiho Choi , Wei D. Lu , Joon Hee Choi

Deploying deep learning models in time-critical applications with limited computational resources, for instance in edge computing systems and IoT networks, is a challenging task that often relies on dynamic inference methods such as early…

Machine Learning · Computer Science 2022-06-30 Arian Bakhtiarnia , Qi Zhang , Alexandros Iosifidis

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications. However, they are also notorious for being slow in inference, which makes them difficult to deploy in real-time applications. We…

Computation and Language · Computer Science 2020-04-28 Ji Xin , Raphael Tang , Jaejun Lee , Yaoliang Yu , Jimmy Lin

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is…

Robotics · Computer Science 2025-06-17 Wenxuan Song , Jiayi Chen , Pengxiang Ding , Yuxin Huang , Han Zhao , Donglin Wang , Haoang Li

Transformer-based NLP models are powerful but have high computational costs that limit deployment. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as…

Computation and Language · Computer Science 2024-11-19 Bo-Ru Lu , Nikita Haduong , Chien-Yu Lin , Hao Cheng , Noah A. Smith , Mari Ostendorf

Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose…

Machine Learning · Computer Science 2025-02-04 Divya Jyoti Bajpai , Manjesh Kumar Hanawal

Deep learning (DL) techniques are increasingly pervasive across various domains, including wireless communication, where they extract insights from raw radio signals. However, the computational demands of DL pose significant challenges,…

Signal Processing · Electrical Eng. & Systems 2024-09-05 Dieter Verbruggen , Hazem Sallouha , Sofie Pollin

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining…

Computation and Language · Computer Science 2026-03-26 Rui Wei , Rui Du , Hanfei Yu , Devesh Tiwari , Jian Li , Zhaozhuo Xu , Hao Wang

The recently proposed end-to-end transformer detectors, such as DETR and Deformable DETR, have a cascade structure of stacking 6 decoder layers to update object queries iteratively, without which their performance degrades seriously. In…

Computer Vision and Pattern Recognition · Computer Science 2021-04-06 Zhuyu Yao , Jiangbo Ai , Boxun Li , Chi Zhang

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

Recent work in multilingual translation advances translation quality surpassing bilingual baselines using deep transformer models with increased capacity. However, the extra latency and memory costs introduced by this approach may make it…

Computation and Language · Computer Science 2022-06-07 Xiang Kong , Adithya Renduchintala , James Cross , Yuqing Tang , Jiatao Gu , Xian Li

Early-Exit (EE) is a Large Language Model (LLM) architecture that accelerates inference by allowing easier tokens to be generated using only a subset of the model's layers. However, traditional batching frameworks are ill-suited for EE…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-12-18 Xuting Liu , Daniel Alexander , Siva Kesava Reddy Kakarla , Behnaz Arzani , Vincent Liu

To tackle the high inference latency exhibited by autoregressive language models, previous studies have proposed an early-exiting framework that allocates adaptive computation paths for each token based on the complexity of generating the…

Computation and Language · Computer Science 2023-10-10 Sangmin Bae , Jongwoo Ko , Hwanjun Song , Se-Young Yun

Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers,…

Sound · Computer Science 2023-09-14 Zhengyang Chen , Bing Han , Shuai Wang , Yanmin Qian

Discrete diffusion models enable parallel token sampling for faster inference than autoregressive approaches. However, prior diffusion models use a decoder-only architecture, which requires sampling algorithms that invoke the full network…

Machine Learning · Computer Science 2025-10-28 Marianne Arriola , Yair Schiff , Hao Phung , Aaron Gokaslan , Volodymyr Kuleshov
‹ Prev 1 2 3 10 Next ›