English
Related papers

Related papers: Efficient Token Pruning for LLaDA-V

200 papers

In this work, we present FastAV, the first token pruning framework tailored for audio-visual large language models (AV-LLMs). While token pruning has been actively explored in standard large language models (LLMs) and vision-language models…

Machine Learning · Computer Science 2026-01-21 Chaeyoung Jung , Youngjoon Jang , Seungwoo Lee , Joon Son Chung

Token pruning has emerged as an effective approach to reduce the substantial computational overhead of Large Vision-Language Models (LVLMs) by discarding less informative visual tokens while preserving performance. However, existing methods…

Computer Vision and Pattern Recognition · Computer Science 2026-04-14 Kexin Ma , Jing Xiao , Chaofeng Chen , Geyong Min , Guibo Zhu , Jinqiao Wang , Liang Liao

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hao Zhang , Mengsi Lyu , Chenrui He , Yulong Ao , Yonghua Lin

In this work, we introduce LLaDA-V, a purely diffusion-based Multimodal Large Language Model (MLLM) that integrates visual instruction tuning with masked diffusion models, representing a departure from the autoregressive paradigms dominant…

Machine Learning · Computer Science 2025-06-05 Zebin You , Shen Nie , Xiaolu Zhang , Jun Hu , Jun Zhou , Zhiwu Lu , Ji-Rong Wen , Chongxuan Li

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal reasoning and generation, yet their high computational demands remain a major challenge. Diffusion Vision-Language Models (DVLMs) are particularly attractive…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jingqi Xu , Jingxi Lu , Chenghao Li , Sreetama Sarkar , Souvik Kundu , Peter A. Beerel

Despite their powerful capabilities, Multimodal Large Language Models (MLLMs) suffer from considerable computational overhead due to their reliance on massive visual tokens. Recent studies have explored token pruning to alleviate this…

Computer Vision and Pattern Recognition · Computer Science 2025-10-13 Xin Zou , Di Lu , Yizhou Wang , Yibo Yan , Yuanhuiyi Lyu , Xu Zheng , Linfeng Zhang , Xuming Hu

Multi-modal Large Language Models (MLLMs) have achieved remarkable success by integrating visual and textual modalities. However, they incur significant computational overhead due to the large number of vision tokens processed, limiting…

Computation and Language · Computer Science 2025-03-11 Yizheng Sun , Yanze Xin , Hao Li , Jingyuan Sun , Chenghua Lin , Riza Batista-Navarro

Multimodal large language models (MLLMs) enhance their perceptual capabilities by integrating visual and textual information. However, processing the massive number of visual tokens incurs a significant computational cost. Existing analysis…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Jiedong Zhuang , Lu Lu , Ming Dai , Rui Hu , Jian Chen , Qiang Liu , Haoji Hu

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yingqi Fan , Anhao Zhao , Jinlan Fu , Junlong Tong , Hui Su , Yijie Pan , Wei Zhang , Xiaoyu Shen

Discrete diffusion-based multimodal large language models (dMLLMs) have emerged as a promising alternative to autoregressive MLLMs thanks to their advantages in parallel decoding and bidirectional context modeling, but most existing dMLLMs…

Computer Vision and Pattern Recognition · Computer Science 2025-11-20 Duo Li , Zuhao Yang , Xiaoqin Zhang , Ling Shao , Shijian Lu

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual…

Computer Vision and Pattern Recognition · Computer Science 2025-05-13 Qizhe Zhang , Aosong Cheng , Ming Lu , Renrui Zhang , Zhiyong Zhuo , Jiajun Cao , Shaobo Guo , Qi She , Shanghang Zhang

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive…

Computer Vision and Pattern Recognition · Computer Science 2025-04-01 Cheng Yang , Yang Sui , Jinqi Xiao , Lingyi Huang , Yu Gong , Chendi Li , Jinghua Yan , Yu Bai , Ponnuswamy Sadayappan , Xia Hu , Bo Yuan

Vision-language models (VLMs) have achieved impressive performance on multimodal reasoning tasks such as visual question answering, image captioning and so on, but their inference cost remains a significant challenge due to the large number…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Weichen Zhang , Zhui Zhu , Ningbo Li , Shilong Tao , Kebin Liu , Yunhao Liu

Vision-language models (VLMs) typically encode substantially more visual tokens than text tokens, resulting in significant token redundancy. Pruning uninformative visual tokens is therefore crucial for improving computational efficiency,…

Computer Vision and Pattern Recognition · Computer Science 2026-01-19 Kai Zhao , Wubang Yuan , Yuchen Lin , Liting Ruan , Xiaofeng Lu , Deng-Ping Fan , Ming-Ming Cheng , Dan Zeng

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Ao Li , Yuxiang Duan , Jinghui Zhang , Congbo Ma , Yutong Xie , Gustavo Carneiro , Mohammad Yaqub , Hu Wang

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences…

Computer Vision and Pattern Recognition · Computer Science 2026-05-21 Yvon Apedo , Martyna Poreba , Michal Szczepanski , Samia Bouchafa

Although large vision-language models (LVLMs) leverage rich visual token representations to achieve strong performance on multimodal tasks, these tokens also introduce significant computational overhead during inference. Existing…

Machine Learning · Computer Science 2025-05-20 Yichen Guo , Hanze Li , Zonghao Zhang , Jinhao You , Kai Tang , Xiande Huang

Large Vision Language Models (LVLMs) have achieved significant success across multi-modal tasks. However, the computational cost of processing long visual tokens can be prohibitively expensive on resource-limited devices. Previous methods…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Xubing Ye , Yukang Gan , Yixiao Ge , Xiao-Ping Zhang , Yansong Tang

Robotic manipulation with Vision-Language-Action models requires efficient inference over long-horizon multi-modal context, where attention to dense visual tokens dominates computational cost. Existing methods optimize inference speed by…

Robotics · Computer Science 2025-09-29 Xiaohuan Pei , Yuxing Chen , Siyu Xu , Yunke Wang , Yuheng Shi , Chang Xu

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse…

Computer Vision and Pattern Recognition · Computer Science 2026-02-16 Omer Faruk Deniz , Ruiyu Mao , Ruochen Li , Yapeng Tian , Latifur Khan
‹ Prev 1 2 3 10 Next ›