Related papers: Token Sequence Compression for Efficient Multimoda…

A Survey of Token Compression for Efficient Multimodal Large Language Models

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

Towards Lossless Ultimate Vision Token Compression for VLMs

Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dehua Zheng , Mouxiao Huang , Borui Jiang , Hailin Hu , Xinghao Chen

UniCompress: Token Compression for Unified Vision-Language Understanding and Generation

Unified models aim to support both understanding and generation by encoding images into discrete tokens and processing them alongside text within a single autoregressive framework. This unified design offers architectural simplicity and…

Computer Vision and Pattern Recognition · Computer Science 2026-03-13 Ziyao Wang , Chen Chen , Jingtao Li , Weiming Zhuang , Jiabo Huang , Ang Li , Lingjuan Lyu

VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) encounter significant computational and memory bottlenecks from the massive number of visual tokens generated by high-resolution images or multi-image inputs. Previous token compression techniques…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Jiaying Zhu , Yurui Zhu , Xin Lu , Wenrui Yan , Dong Li , Kunlin Liu , Xueyang Fu , Zheng-Jun Zha

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Jiafei Song , Fengwei Zhou , Jin Qu , Wenjin Jason Li , Tong Wu , Gengjian Xue , Zhikang Zhao , Daomin Wei , Yichao Lu , Bailin Na

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han

LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models

Large Multimodal Models (LMMs) have shown significant visual reasoning capabilities by connecting a visual encoder and a large language model. LMMs typically take in a fixed and large amount of visual tokens, such as the penultimate layer…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Yuzhang Shang , Mu Cai , Bingxin Xu , Yong Jae Lee , Yan Yan

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hao Zhang , Mengsi Lyu , Chenrui He , Yulong Ao , Yonghua Lin

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Yuke Zhu , Chi Xie , Shuang Liang , Bo Zheng , Sheng Guo

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Zeliang Zhang , Phu Pham , Wentian Zhao , Kun Wan , Yu-Jhe Li , Jianing Zhou , Daniel Miranda , Ajinkya Kale , Chenliang Xu

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Tianfan Peng , Yuntao Du , Pengzhou Ji , Shijie Dong , Kailin Jiang , Mingchuan Ma , Yijun Tian , Jinhe Bi , Qian Li , Wei Du , Feng Xiao , Lizhen Cui

FlashVLM: Text-Guided Visual Token Selection for Large Multimodal Models

Large vision-language models (VLMs) typically process hundreds or thousands of visual tokens per image or video frame, incurring quadratic attention cost and substantial redundancy. Existing token reduction methods often ignore the textual…

Computer Vision and Pattern Recognition · Computer Science 2025-12-24 Kaitong Cai , Jusheng Zhang , Jing Yang , Yijia Fan , Pengtao Xie , Jian Wang , Keze Wang

Efficient Large Multi-modal Models via Visual Context Compression

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Jieneng Chen , Luoxin Ye , Ju He , Zhao-Yang Wang , Daniel Khashabi , Alan Yuille

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Junjie Chen , Xuyang Liu , Zichen Wen , Yiyu Wang , Siteng Huang , Honggang Chen

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

Compound Tokens: Channel Fusion for Vision-Language Representation Learning

We present an effective method for fusing visual-and-language representations for several question answering tasks including visual question answering and visual entailment. In contrast to prior works that concatenate unimodal…

Computer Vision and Pattern Recognition · Computer Science 2022-12-06 Maxwell Mbabilla Aladago , AJ Piergiovanni

Revisiting MLLM Token Technology through the Lens of Classical Visual Coding

Classical visual coding and Multimodal Large Language Model (MLLM) token technology share the core objective - maximizing information fidelity while minimizing computational cost. Therefore, this paper reexamines MLLM token technology,…

Computer Vision and Pattern Recognition · Computer Science 2025-08-20 Jinming Liu , Junyan Lin , Yuntao Wei , Kele Shao , Keda Tao , Jianguo Huang , Xudong Yang , Zhibo Chen , Huan Wang , Xin Jin

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Yuchen Liu , Yaoming Wang , Bowen Shi , Xiaopeng Zhang , Wenrui Dai , Chenglin Li , Hongkai Xiong , Qi Tian

Compression Tells Intelligence: Visual Coding, Visual Token Technology, and the Unification

"Compression Tells Intelligence", is supported by research in artificial intelligence, particularly concerning (multimodal) large language models (LLMs/MLLMs), where compression efficiency often correlates with improved model performance…

Computer Vision and Pattern Recognition · Computer Science 2026-01-29 Xin Jin , Jinming Liu , Yuntao Wei , Junyan Lin , Zhicheng Wang , Jianguo Huang , Xudong Yang , Yanxiao Liu , Wenjun Zeng

AdaTok: Adaptive Token Compression with Object-Aware Representations for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) have demonstrated substantial value in unified text-image understanding and reasoning, primarily by converting images into sequences of patch-level tokens that align with their architectural…

Computer Vision and Pattern Recognition · Computer Science 2025-11-25 Xinliang Zhang , Lei Zhu , Hangzhou He , Shuang Zeng , Ourui Fu , Jiakui Hu , Zhengjian Yao , Yanye Lu