Related papers: Efficient Visual Transformer by Learnable Token Me…

Learned Thresholds Token Merging and Pruning for Vision Transformers

Vision transformers have demonstrated remarkable success in a wide range of computer vision tasks over the last years. However, their high computational costs remain a significant barrier to their practical deployment. In particular, the…

Computer Vision and Pattern Recognition · Computer Science 2023-08-21 Maxim Bonnaerens , Joni Dambre

big.LITTLE Vision Transformer for Efficient Visual Recognition

In this paper, we introduce the big.LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition. This dual-transformer system is composed of two distinct blocks: the big performance block,…

Computer Vision and Pattern Recognition · Computer Science 2024-10-15 He Guo , Yulong Wang , Zixuan Ye , Jifeng Dai , Yuwen Xiong

Video Token Merging for Long-form Video Understanding

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Seon-Ho Lee , Jue Wang , Zhikang Zhang , David Fan , Xinyu Li

PerceptionGPT: Effectively Fusing Visual Perception into LLM

The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities, giving rise to visual large language models (VLLMs). However, effectively harnessing VLLMs for intricate…

Computer Vision and Pattern Recognition · Computer Science 2023-11-14 Renjie Pi , Lewei Yao , Jiahui Gao , Jipeng Zhang , Tong Zhang

Lossless Token Merging Even Without Fine-Tuning in Vision Transformers

Although Vision Transformers (ViTs) have become the standard architecture in computer vision, their massive sizes lead to significant computational overhead. Token compression techniques have attracted considerable attention to address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-02 Jaeyeon Lee , Dong-Wan Choi

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision Transformers

Recent token reduction methods for Vision Transformers (ViTs) incorporate token merging, which measures the similarities between token embeddings and combines the most similar pairs. However, their merging policies are directly dependent on…

Computer Vision and Pattern Recognition · Computer Science 2024-12-17 Dong Hoon Lee , Seunghoon Hong

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Yuzhang Shang , Mu Cai , Bingxin Xu , Yong Jae Lee , Yan Yan

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

Fine-tuning Image Transformers using Learnable Memory

In this paper we propose augmenting Vision Transformer models with learnable memory tokens. Our approach allows the model to adapt to new tasks, using few parameters, while optionally preserving its capabilities on previously learned tasks.…

Computer Vision and Pattern Recognition · Computer Science 2022-03-31 Mark Sandler , Andrey Zhmoginov , Max Vladymyrov , Andrew Jackson

Less is More: Pay Less Attention in Vision Transformers

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) in computer vision. However, Transformer training and inference in previous works…

Computer Vision and Pattern Recognition · Computer Science 2021-12-24 Zizheng Pan , Bohan Zhuang , Haoyu He , Jing Liu , Jianfei Cai

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Shaolei Zhang , Qingkai Fang , Zhe Yang , Yang Feng

TokenPacker: Efficient Visual Projector for Multimodal LLM

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation.…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Wentong Li , Yuqian Yuan , Jian Liu , Dongqi Tang , Song Wang , Jie Qin , Jianke Zhu , Lei Zhang

MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning

In this paper, we propose Mixture of Layer-Wise Tokens (MoLT), a parameter- and memory-efficient adaptation framework for audio-visual learning. The key idea of MoLT is to replace conventional, computationally heavy sequential adaptation at…

Sound · Computer Science 2025-12-02 Kyeongha Rho , Hyeongkeun Lee , Jae Won Cho , Joon Son Chung

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

Learning to Merge Tokens in Vision Transformers

Transformers are widely applied to solve natural language understanding and computer vision tasks. While scaling up these architectures leads to improved performance, it often comes at the expense of much higher computational costs. In…

Computer Vision and Pattern Recognition · Computer Science 2022-02-25 Cedric Renggli , André Susano Pinto , Neil Houlsby , Basil Mustafa , Joan Puigcerver , Carlos Riquelme

vid-TLDR: Training Free Token merging for Light-weight Video Transformer

Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However, these video transformers suffer from heavy computational costs induced by the massive number…

Computer Vision and Pattern Recognition · Computer Science 2024-04-02 Joonmyung Choi , Sanghyeok Lee , Jaewon Chu , Minhyuk Choi , Hyunwoo J. Kim

Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Ze Feng , Jiang-jiang Liu , Sen Yang , Lingyu Xiao , Zhibin Quan , Zhenhua Feng , Wankou Yang , Jingdong Wang

Multimodal Token Fusion for Vision Transformers

Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers…

Computer Vision and Pattern Recognition · Computer Science 2022-07-18 Yikai Wang , Xinghao Chen , Lele Cao , Wenbing Huang , Fuchun Sun , Yunhe Wang

ClustViT: Clustering-based Token Merging for Semantic Segmentation

Vision Transformers can achieve high accuracy and strong generalization across various contexts, but their practical applicability on real-world robotic systems is limited due to their quadratic attention complexity. Recent works have…

Computer Vision and Pattern Recognition · Computer Science 2026-05-04 Fabio Montello , Ronja Güldenring , Lazaros Nalpantidis

Towards Lossless Ultimate Vision Token Compression for VLMs

Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dehua Zheng , Mouxiao Huang , Borui Jiang , Hailin Hu , Xinghao Chen