Related papers: Saliency-driven Dynamic Token Pruning for Large La…

Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference

Long-context inference enhances the reasoning capability of Large Language Models (LLMs), but incurs significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown great promise in reducing inference…

Computation and Language · Computer Science 2026-02-03 Zimeng Wu , Donghao Wang , Chaozhe Jin , Jiaxin Chen , Yunhong Wang

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the…

Computation and Language · Computer Science 2024-12-17 Zekai Li , Jintu Zheng , Ji Liu , Han Liu , Haowei Zhu , Zeping Li , Fuwei Yang , Haiduo Huang , Jinzhang Peng , Dong Li , Lu Tian , Emad Barsoum

Learned Token Pruning for Transformers

Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes…

Computation and Language · Computer Science 2022-06-06 Sehoon Kim , Sheng Shen , David Thorsley , Amir Gholami , Woosuk Kwon , Joseph Hassoun , Kurt Keutzer

MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer

Vision-Language Transformers (VLTs) have shown great success recently, but are meanwhile accompanied by heavy computation costs, where a major reason can be attributed to the large number of visual and language tokens. Existing token…

Computer Vision and Pattern Recognition · Computer Science 2024-03-06 Jianjian Cao , Peng Ye , Shengze Li , Chong Yu , Yansong Tang , Jiwen Lu , Tao Chen

Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation

Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs…

Computer Vision and Pattern Recognition · Computer Science 2023-09-29 Quan Tang , Bowen Zhang , Jiajun Liu , Fagui Liu , Yifan Liu

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Yuxuan Liang , Xu Li , Xiaolei Chen , Yi Zheng , Haotian Chen , Bin Li , Xiangyang Xue

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer,…

Computation and Language · Computer Science 2025-11-25 Lingkun Long , Rubing Yang , Yushi Huang , Desheng Hui , Ao Zhou , Jianlei Yang

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by…

Computer Vision and Pattern Recognition · Computer Science 2025-04-03 Saeed Ranjbar Alvar , Gursimran Singh , Mohammad Akbari , Yong Zhang

DLP: Dynamic Layerwise Pruning in Large Language Models

Pruning has recently been widely adopted to reduce the parameter scale and improve the inference efficiency of Large Language Models (LLMs). Mainstream pruning techniques often rely on uniform layerwise pruning strategies, which can lead to…

Computation and Language · Computer Science 2025-06-04 Yuli Chen , Bo Cheng , Jiale Han , Yingying Zhang , Yingting Li , Shuhao Zhang

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens…

Computation and Language · Computer Science 2024-06-03 Sotiris Anagnostidis , Dario Pavllo , Luca Biggio , Lorenzo Noci , Aurelien Lucchi , Thomas Hofmann

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with…

Computation and Language · Computer Science 2025-05-30 Zichen Wen , Yifeng Gao , Weijia Li , Conghui He , Linfeng Zhang

Leveraging KV Similarity for Online Structured Pruning in LLMs

Pruning has emerged as a promising direction for accelerating large language model (LLM) inference, yet existing approaches often suffer from instability because they rely on offline calibration data that may not generalize across inputs.…

Computation and Language · Computer Science 2025-12-09 Jungmin Lee , Gwangeun Byeon , Yulhwa Kim , Seokin Hong

RedVTP: Training-Free Acceleration of Diffusion Vision-Language Models Inference via Masked Token-Guided Visual Token Pruning

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Chenyang Li , Jieyuan Liu , Bin Li , Bo Gao , Yilin Yuan , Yangfan He , Yuchen Li , Jingqun Tang

Deterministic Differentiable Structured Pruning for Large Language Models

Structured pruning reduces LLM inference cost by removing low-importance architectural components. This can be viewed as learning a multiplicative gate for each component under an l0 sparsity constraint. Due to the discreteness of the l0…

Machine Learning · Computer Science 2026-05-12 Weiyu Huang , Pengle Zhang , Xiaolu Zhang , Jun Zhou , Jun Zhu , Jianfei Chen

ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning

Large Vision-Language Models (LVLMs) incur high computational costs due to significant redundancy in their visual tokens. To effectively reduce this cost, researchers have proposed various visual token pruning methods. However, existing…

Computer Vision and Pattern Recognition · Computer Science 2026-01-27 Wen Luo , Peng Chen , Xiaotao Huang , LiQun Huang

Flatter Tokens are More Valuable for Speculative Draft Model Training

Speculative Decoding (SD) is a key technique for accelerating Large Language Model (LLM) inference, but it typically requires training a draft model on a large dataset. We approach this problem from a data-centric perspective, finding that…

Computation and Language · Computer Science 2026-02-19 Jiaming Fan , Daming Cao , Xiangzhong Luo , Jiale Fu , Chonghan Liu , Xu Yang

Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing

Vision language models (VLMs) demonstrate strong capabilities in jointly processing visual and textual data. However, they often incur substantial computational overhead due to redundant visual information, particularly in long-form video…

Machine Learning · Computer Science 2025-04-25 Yudong Liu , Jingwei Sun , Yueqian Lin , Jingyang Zhang , Ming Yin , Qinsi Wang , Jianyi Zhang , Hai Li , Yiran Chen

L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models

Large language models (LLMs) have achieved notable progress. Despite their success, next-token prediction (NTP), the dominant method for LLM training and inference, is constrained in both contextual coverage and inference efficiency due to…

Computation and Language · Computer Science 2025-09-23 Xiaohao Liu , Xiaobo Xia , Weixiang Zhao , Manyi Zhang , Xianzhi Yu , Xiu Su , Shuo Yang , See-Kiong Ng , Tat-Seng Chua