Related papers: Learned Token Pruning for Transformers

Learned Threshold Pruning

This paper presents a novel differentiable method for unstructured weight pruning of deep neural networks. Our learned-threshold pruning (LTP) method learns per-layer thresholds via gradient descent, unlike conventional methods where they…

Machine Learning · Computer Science 2021-03-22 Kambiz Azarian , Yash Bhalgat , Jinwon Lee , Tijmen Blankevoort

Learned Thresholds Token Merging and Pruning for Vision Transformers

Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Maxim Bonnaerens , Joni Dambre

FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing

Recently, large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws, which significantly increase model size. However, the huge computation overhead during inference hinders the…

Computation and Language · Computer Science 2024-12-17 Zekai Li , Jintu Zheng , Ji Liu , Han Liu , Haowei Zhu , Zeping Li , Fuwei Yang , Haiduo Huang , Jinzhang Peng , Dong Li , Lu Tian , Emad Barsoum

Constraint-aware and Ranking-distilled Token Pruning for Efficient Transformer Inference

Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a…

Computation and Language · Computer Science 2023-06-27 Junyan Li , Li Lyna Zhang , Jiahang Xu , Yujing Wang , Shaoguang Yan , Yunqing Xia , Yuqing Yang , Ting Cao , Hao Sun , Weiwei Deng , Qi Zhang , Mao Yang

Pyramid Token Pruning for High-Resolution Large Vision-Language Models via Region, Token, and Instruction-Guided Importance

Large Vision-Language Models (LVLMs) have recently demonstrated strong multimodal understanding, yet their fine-grained visual perception is often constrained by low input resolutions. A common remedy is to partition high-resolution images…

Computer Vision and Pattern Recognition · Computer Science 2026-02-17 Yuxuan Liang , Xu Li , Xiaolei Chen , Yi Zheng , Haotian Chen , Bin Li , Xiangyang Xue

Saliency-driven Dynamic Token Pruning for Large Language Models

Despite the recent success of large language models (LLMs), LLMs are particularly challenging in long-sequence inference scenarios due to the quadratic computational complexity of the attention mechanism. Inspired by the interpretability…

Computation and Language · Computer Science 2025-04-10 Yao Tao , Yehui Tang , Yun Wang , Mingjian Zhu , Hailin Hu , Yunhe Wang

Balanced Token Pruning: Accelerating Vision Language Models Beyond Local Optimization

Large Vision-Language Models (LVLMs) have shown impressive performance across multi-modal tasks by encoding images into thousands of tokens. However, the large number of image tokens results in significant computational overhead, and the…

Computer Vision and Pattern Recognition · Computer Science 2025-10-24 Kaiyuan Li , Xiaoyue Chen , Chen Gao , Yong Li , Xinlei Chen

ATP-LLaVA: Adaptive Token Pruning for Large Vision Language Models

Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Xubing Ye , Yukang Gan , Yixiao Ge , Xiao-Ping Zhang , Yansong Tang

A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention

In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models…

Computation and Language · Computer Science 2025-01-24 Heejun Lee , Geon Park , Youngwan Lee , Jaduk Suh , Jina Kim , Wonyoung Jeong , Bumsik Kim , Hyemin Lee , Myeongjae Jeon , Sung Ju Hwang

Random-LTD: Random and Layerwise Token Dropping Brings Efficient Training for Large-scale Transformers

Large-scale transformer models have become the de-facto architectures for various machine learning applications, e.g., CV and NLP. However, those large models also introduce prohibitive training costs. To mitigate this issue, we propose a…

Computation and Language · Computer Science 2022-11-22 Zhewei Yao , Xiaoxia Wu , Conglong Li , Connor Holmes , Minjia Zhang , Cheng Li , Yuxiong He

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Dong-Jae Lee , Sunghyun Baek , Junmo Kim

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers

Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens…

Computation and Language · Computer Science 2024-06-03 Sotiris Anagnostidis , Dario Pavllo , Luca Biggio , Lorenzo Noci , Aurelien Lucchi , Thomas Hofmann

SparseCoder: Advancing Source Code Analysis with Sparse Attention and Learned Token Pruning

As software projects rapidly evolve, software artifacts become more complex and defects behind get harder to identify. The emerging Transformer-based approaches, though achieving remarkable performance, struggle with long code sequences due…

Software Engineering · Computer Science 2024-09-13 Xueqi Yang , Mariusz Jakubowski , Li Kang , Haojie Yu , Tim Menzies

TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Ao Li , Yuxiang Duan , Jinghui Zhang , Congbo Ma , Yutong Xie , Gustavo Carneiro , Mohammad Yaqub , Hu Wang

CipherPrune: Efficient and Scalable Private Transformer Inference

Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token…

Machine Learning · Computer Science 2025-03-07 Yancheng Zhang , Jiaqi Xue , Mengxin Zheng , Mimi Xie , Mingzhe Zhang , Lei Jiang , Qian Lou

Adaptive Layer Selection for Layer-Wise Token Pruning in LLM Inference

Due to the prevalence of large language models (LLMs), key-value (KV) cache reduction for LLM inference has received remarkable attention. Among numerous works that have been proposed in recent years, layer-wise token pruning approaches,…

Computation and Language · Computer Science 2026-04-17 Rei Taniguchi , Yuyang Dong , Makoto Onizuka , Chuan Xiao

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Hanning Chen , Yang Ni , Wenjun Huang , Yezi Liu , SungHeon Jeong , Fei Wen , Nathaniel Bastian , Hugo Latapie , Mohsen Imani

The Better You Learn, The Smarter You Prune: Towards Efficient Vision-language-action Models via Differentiable Token Pruning

We present LightVLA, a simple yet effective differentiable token pruning framework for vision-language-action (VLA) models. While VLA models have shown impressive capability in executing real-world robotic tasks, their deployment on…

Robotics · Computer Science 2025-09-23 Titong Jiang , Xuefeng Jiang , Yuan Ma , Xin Wen , Bailin Li , Kun Zhan , Peng Jia , Yahui Liu , Sheng Sun , Xianpeng Lang

LEAP: Learnable Pruning for Transformer-based Models

Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. However, current pruning algorithms either only focus on one pruning category, e.g., structured…

Computation and Language · Computer Science 2022-05-24 Zhewei Yao , Xiaoxia Wu , Linjian Ma , Sheng Shen , Kurt Keutzer , Michael W. Mahoney , Yuxiong He

Large Language Models Are Overparameterized Text Encoders

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that…

Computation and Language · Computer Science 2024-10-21 Thennal D K , Tim Fischer , Chris Biemann