Related papers: Dynamic layer selection in decoder-only transforme…

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of…

Computation and Language · Computer Science 2024-04-11 Yijin Liu , Fandong Meng , Jie Zhou

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an…

Computation and Language · Computer Science 2024-10-21 Mostafa Elhoushi , Akshat Shrivastava , Diana Liskovich , Basil Hosmer , Bram Wasti , Liangzhen Lai , Anas Mahmoud , Bilge Acun , Saurabh Agarwal , Ahmed Roman , Ahmed A Aly , Beidi Chen , Carole-Jean Wu

Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE

Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we propose to…

Computation and Language · Computer Science 2023-11-08 Neeraj Varshney , Agneet Chatterjee , Mihir Parmar , Chitta Baral

The Diminishing Returns of Early-Exit Decoding in Modern LLMs

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining…

Computation and Language · Computer Science 2026-03-26 Rui Wei , Rui Du , Hanfei Yu , Devesh Tiwari , Jian Li , Zhaozhuo Xu , Hao Wang

Confident Adaptive Language Modeling

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use…

Computation and Language · Computer Science 2022-10-26 Tal Schuster , Adam Fisch , Jai Gupta , Mostafa Dehghani , Dara Bahri , Vinh Q. Tran , Yi Tay , Donald Metzler

Accelerating Large Language Model Inference with Self-Supervised Early Exits

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

DISC: Dynamic Decomposition Improves LLM Inference Scaling

Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually…

Machine Learning · Computer Science 2025-10-07 Jonathan Light , Wei Cheng , Benjamin Riviere , Wu Yue , Masafumi Oyamada , Mengdi Wang , Yisong Yue , Santiago Paternain , Haifeng Chen

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components…

Computation and Language · Computer Science 2025-11-27 Siqi Fan , Xuezhi Fang , Xingrun Xing , Peng Han , Shuo Shang , Yequan Wang

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

Two-dimensional early exit optimisation of LLM inference

We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively…

Computation and Language · Computer Science 2026-04-22 Jan Hůla , David Adamczyk , Tomáš Filip , Martin Pavlíček , Petr Sosík

ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding

Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model…

Machine Learning · Computer Science 2026-04-17 Walaa Amer , Uday das , Fadi Kurdahi

When Neural Code Completion Models Size up the Situation: Attaining Cheaper and Faster Completion through Dynamic Model Inference

Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of…

Software Engineering · Computer Science 2024-01-19 Zhensu Sun , Xiaoning Du , Fu Song , Shangwen Wang , Li Li

The Remarkable Robustness of LLMs: Stages of Inference?

We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any…

Machine Learning · Computer Science 2025-06-17 Vedang Lad , Jin Hwa Lee , Wes Gurnee , Max Tegmark

Dynamic Vocabulary Pruning in Early-Exit LLMs

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of…

Computation and Language · Computer Science 2024-10-31 Jort Vincenti , Karim Abdel Sadek , Joan Velja , Matteo Nulli , Metod Jazbec

A transformer architecture alteration to incentivise externalised reasoning

We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at…

Artificial Intelligence · Computer Science 2026-03-25 Elizabeth Pavlova , Mariia Koroliuk , Karthik Viswanathan , Cameron Tice , Edward James Young , Puria Radmard

SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference

Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To…

Computation and Language · Computer Science 2023-07-07 Luciano Del Corro , Allie Del Giorno , Sahaj Agarwal , Bin Yu , Ahmed Awadallah , Subhabrata Mukherjee

DEL: Context-Aware Dynamic Exit Layer for Efficient Self-Speculative Decoding

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed…

Computation and Language · Computer Science 2025-08-08 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling

Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies,…

Computation and Language · Computer Science 2025-06-05 Anhao Zhao , Fanghua Ye , Yingqi Fan , Junlong Tong , Zhiwei Fei , Hui Su , Xiaoyu Shen

Dr.LLM: Dynamic Layer Routing in LLMs

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can…

Computation and Language · Computer Science 2026-05-20 Ahmed Heakl , Martin Gubri , Salman Khan , Sangdoo Yun , Seong Joon Oh