English
Related papers

Related papers: Dynamic layer selection in decoder-only transforme…

200 papers

Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of…

Computation and Language · Computer Science 2024-04-11 Yijin Liu , Fandong Meng , Jie Zhou

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an…

Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we propose to…

Computation and Language · Computer Science 2023-11-08 Neeraj Varshney , Agneet Chatterjee , Mihir Parmar , Chitta Baral

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining…

Computation and Language · Computer Science 2026-03-26 Rui Wei , Rui Du , Hanfei Yu , Devesh Tiwari , Jian Li , Zhaozhuo Xu , Hao Wang

Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use…

Computation and Language · Computer Science 2022-10-26 Tal Schuster , Adam Fisch , Jai Gupta , Mostafa Dehghani , Dara Bahri , Vinh Q. Tran , Yi Tay , Donald Metzler

This paper presents a modular approach to accelerate inference in large language models (LLMs) by adding early exit heads at intermediate transformer layers. Each head is trained in a self-supervised manner to mimic the main model's…

Computation and Language · Computer Science 2026-02-13 Florian Valade

Inference scaling methods for LLMs often rely on decomposing problems into steps (or groups of tokens), followed by sampling and selecting the best next steps. However, these steps and their sizes are often predetermined or manually…

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components…

Computation and Language · Computer Science 2025-11-27 Siqi Fan , Xuezhi Fang , Xingrun Xing , Peng Han , Shuo Shang , Yequan Wang

One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during…

Computation and Language · Computer Science 2024-11-21 Sean Welleck , Amanda Bertsch , Matthew Finlayson , Hailey Schoelkopf , Alex Xie , Graham Neubig , Ilia Kulikov , Zaid Harchaoui

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

We introduce a two-dimensional (2D) early exit strategy that coordinates layer-wise and sentence-wise exiting for classification tasks in large language models. By processing input incrementally sentence-by-sentence while progressively…

Computation and Language · Computer Science 2026-04-22 Jan Hůla , David Adamczyk , Tomáš Filip , Martin Pavlíček , Petr Sosík

Self-speculative decoding is an inference technique for large language models designed to speed up generation without sacrificing output quality. It combines fast, approximate decoding using a compact version of the model as a draft model…

Machine Learning · Computer Science 2026-04-17 Walaa Amer , Uday das , Fadi Kurdahi

Leveraging recent advancements in large language models, modern neural code completion models have demonstrated the capability to generate highly accurate code suggestions. However, their massive size poses challenges in terms of…

Software Engineering · Computer Science 2024-01-19 Zhensu Sun , Xiaoning Du , Fu Song , Shangwen Wang , Li Li

We investigate the robustness of Large Language Models (LLMs) to structural interventions by deleting and swapping adjacent layers during inference. Surprisingly, models retain 72-95% of their original top-1 prediction accuracy without any…

Machine Learning · Computer Science 2025-06-17 Vedang Lad , Jin Hwa Lee , Wes Gurnee , Max Tegmark

Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of…

Computation and Language · Computer Science 2024-10-31 Jort Vincenti , Karim Abdel Sadek , Joan Velja , Matteo Nulli , Metod Jazbec

We propose a new architectural change, and post-training pipeline, for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We augment an existing transformer architecture with an early-exit mechanism at…

Artificial Intelligence · Computer Science 2026-03-25 Elizabeth Pavlova , Mariia Koroliuk , Karthik Viswanathan , Cameron Tice , Edward James Young , Puria Radmard

Autoregressive large language models (LLMs) have made remarkable progress in various natural language generation tasks. However, they incur high computation cost and latency resulting from the autoregressive token-by-token generation. To…

Computation and Language · Computer Science 2023-07-07 Luciano Del Corro , Allie Del Giorno , Sahaj Agarwal , Bin Yu , Ahmed Awadallah , Subhabrata Mukherjee

Speculative Decoding (SD) is a widely used approach to accelerate the inference of large language models (LLMs) without reducing generation quality. It operates by first using a compact model to draft multiple tokens efficiently, followed…

Computation and Language · Computer Science 2025-08-08 Hossein Entezari Zarch , Lei Gao , Chaoyi Jiang , Murali Annavaram

Large language models (LLMs) achieve remarkable performance across tasks but incur substantial computational costs due to their deep, multi-layered architectures. Layer pruning has emerged as a strategy to alleviate these inefficiencies,…

Computation and Language · Computer Science 2025-06-05 Anhao Zhao , Fanghua Ye , Yingqi Fan , Junlong Tong , Zhiwei Fei , Hui Su , Xiaoyu Shen

Large Language Models (LLMs) process every token through all layers of a transformer stack, causing wasted computation on simple queries and insufficient flexibility for harder ones that need deeper reasoning. Adaptive-depth methods can…

Computation and Language · Computer Science 2026-05-20 Ahmed Heakl , Martin Gubri , Salman Khan , Sangdoo Yun , Seong Joon Oh
‹ Prev 1 2 3 10 Next ›