Related papers: FOLDER: Accelerating Multi-modal Large Language Mo…

TokenPacker: Efficient Visual Projector for Multimodal LLM

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation.…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Wentong Li , Yuqian Yuan , Jian Liu , Dongqi Tang , Song Wang , Jie Qin , Jianke Zhu , Lei Zhang

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Zhihang Lin , Mingbao Lin , Luxi Lin , Rongrong Ji

Towards General Continuous Memory for Vision-Language Models

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

Efficient Large Multi-modal Models via Visual Context Compression

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Jieneng Chen , Luoxin Ye , Ju He , Zhao-Yang Wang , Daniel Khashabi , Alan Yuille

ShortV: Efficient Multimodal Large Language Models by Freezing Visual Tokens in Ineffective Layers

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Qianhao Yuan , Qingyu Zhang , Yanjiang Liu , Jiawei Chen , Yaojie Lu , Hongyu Lin , Jia Zheng , Xianpei Han , Le Sun

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Shiwei Wu , Joya Chen , Kevin Qinghong Lin , Qimeng Wang , Yan Gao , Qianli Xu , Tong Xu , Yao Hu , Enhong Chen , Mike Zheng Shou

A Survey of Token Compression for Efficient Multimodal Large Language Models

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

From Text to Pixel: Advancing Long-Context Understanding in MLLMs

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Yujie Lu , Xiujun Li , Tsu-Jui Fu , Miguel Eckstein , William Yang Wang

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference…

Computation and Language · Computer Science 2025-06-03 Guoxuan Chen , Han Shi , Jiawei Li , Yihang Gao , Xiaozhe Ren , Yimeng Chen , Xin Jiang , Zhenguo Li , Weiyang Liu , Chao Huang

Token Sequence Compression for Efficient Multimodal Computing

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference

Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Hao Yin , Guangzong Si , Zilei Wang

FastMMoE: Accelerating Multimodal Large Language Models through Dynamic Expert Activation and Routing-Aware Token Pruning

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Guoyang Xia , Yifeng Ding , Fengfa Li , Lei Ren , Wei Chen , Fangxiang Feng , Xiaojie Wang

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yingqi Fan , Anhao Zhao , Jinlan Fu , Junlong Tong , Hui Su , Yijie Pan , Wei Zhang , Xiaoyu Shen

Efficient-VLN: A Training-Efficient Vision-Language Navigation Model

Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-12 Duo Zheng , Shijia Huang , Yanyang Li , Liwei Wang

Reducing Peak Memory Usage for Modern Multimodal Large Language Model Pipelines

Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Junwan Kim , Hyunkyung Bae

InfMLLM: A Unified Framework for Visual-Language Tasks

Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-07 Qiang Zhou , Zhibin Wang , Wei Chu , Yinghui Xu , Hao Li , Yuan Qi

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Yingen Liu , Fan Wu , Ruihui Li , Zhuo Tang , Kenli Li