Related papers: Efficient Large Multi-modal Models via Visual Cont…

FCoT-VL:Advancing Text-oriented Large Vision-Language Models with Efficient Visual Token Compression

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods…

Computer Vision and Pattern Recognition · Computer Science 2025-02-27 Jianjian Li , Junquan Fan , Feng Tang , Gang Huang , Shitao Zhu , Songlin Liu , Nian Xie , Wulong Liu , Yong Liao

Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Zeliang Zhang , Phu Pham , Wentian Zhao , Kun Wan , Yu-Jhe Li , Jianing Zhou , Daniel Miranda , Ajinkya Kale , Chenliang Xu

FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Yuke Zhu , Chi Xie , Shuang Liang , Bo Zheng , Sheng Guo

Beyond Intermediate States: Explaining Visual Redundancy through Language

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Rethinking Token Reduction for Large Vision-Language Models

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Yi Wang , Haofei Zhang , Qihan Huang , Anda Cao , Gongfan Fang , Wei Wang , Xuan Jin , Jie Song , Mingli Song , Xinchao Wang

Token Sequence Compression for Efficient Multimodal Computing

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

EvoComp: Learning Visual Token Compression for Multimodal Large Language Models via Semantic-Guided Evolutionary Labeling

Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Jiafei Song , Fengwei Zhou , Jin Qu , Wenjin Jason Li , Tong Wu , Gengjian Xue , Zhikang Zhao , Daomin Wei , Yichao Lu , Bailin Na

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Shaolei Zhang , Qingkai Fang , Zhe Yang , Yang Feng

LRCP: Low-Rank Compressibility Guided Visual Token Pruning for Efficient LVLMs

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Hongyu Lu , Feng Zhang , Wenwei Jin , Huanling Hu , Tianjun Shi , Shikai Jiang , Yao Hu , Jiawei Li

A Survey of Token Compression for Efficient Multimodal Large Language Models

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Lianyu Hu , Liqing Gao , Fanhua Shang , Liang Wan , Wei Feng

Towards Lossless Ultimate Vision Token Compression for VLMs

Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dehua Zheng , Mouxiao Huang , Borui Jiang , Hailin Hu , Xinghao Chen

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

Multimodal large language models (MLLMs) have demonstrated great performance on visual question answering (VQA). When it comes to knowledge-based Visual Question Answering (KB-VQA), MLLMs may lack the specialized domain knowledge needed to…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Weixi Weng , Jieming Zhu , Xiaojun Meng , Hao Zhang , Rui Zhang , Chun Yuan

Inference Optimal VLMs Need Fewer Visual Tokens and More Parameters

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Kevin Y. Li , Sachin Goyal , Joao D. Semedo , J. Zico Kolter

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ao Wang , Fengyuan Sun , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Junjie Chen , Xuyang Liu , Zichen Wen , Yiyu Wang , Siteng Huang , Honggang Chen

DUET-VLM: Dual stage Unified Efficient Token reduction for VLM Training and Inference

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Aditya Kumar Singh , Hitesh Kandala , Pratik Prabhanjan Brahma , Zicheng Liu , Emad Barsoum

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Lei Lei , Jie Gu , Xiaokang Ma , Chu Tang , Jingmin Chen , Tong Xu

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos.…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Xubing Ye , Yukang Gan , Xiaoke Huang , Yixiao Ge , Yansong Tang

Towards Efficient Large Vision-Language Models: A Comprehensive Survey on Inference Strategies

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han