English
Related papers

Related papers: Efficient Large Multi-modal Models via Visual Cont…

200 papers

The rapid success of Vision Large Language Models (VLLMs) often depends on the high-resolution images with abundant visual tokens, which hinders training and deployment efficiency. Current training-free visual token compression methods…

Computer Vision and Pattern Recognition · Computer Science 2025-02-27 Jianjian Li , Junquan Fan , Feng Tang , Gang Huang , Shitao Zhu , Songlin Liu , Nian Xie , Wulong Liu , Yong Liao

By treating visual tokens from visual encoders as text tokens, Multimodal Large Language Models (MLLMs) have achieved remarkable progress across diverse visual understanding tasks, leveraging the robust architectures of Large Language…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Zeliang Zhang , Phu Pham , Wentian Zhao , Kun Wan , Yu-Jhe Li , Jianing Zhou , Daniel Miranda , Ajinkya Kale , Chenliang Xu

Recent advances on Multi-modal Large Language Models have demonstrated that high-resolution image input is crucial for model capabilities, especially for fine-grained tasks. However, high-resolution images lead to a quadratic increase in…

Computer Vision and Pattern Recognition · Computer Science 2024-11-22 Yuke Zhu , Chi Xie , Shuang Liang , Bo Zheng , Sheng Guo

Multi-modal Large Langue Models (MLLMs) often process thousands of visual tokens, which consume a significant portion of the context window and impose a substantial computational burden. Prior work has empirically explored visual token…

Computer Vision and Pattern Recognition · Computer Science 2025-03-27 Dingchen Yang , Bowen Cao , Anran Zhang , Weibo Gu , Winston Hu , Guang Chen

Large Vision-Language Models (LVLMs) excel in visual understanding and reasoning, but the excessive visual tokens lead to high inference costs. Although recent token reduction methods mitigate this issue, they mainly target single-turn…

Computer Vision and Pattern Recognition · Computer Science 2026-03-24 Yi Wang , Haofei Zhang , Qihan Huang , Anda Cao , Gongfan Fang , Wei Wang , Xuan Jin , Jie Song , Mingli Song , Xinchao Wang

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Recent Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language understanding tasks, yet their inference efficiency is often hampered by the large number of visual tokens, particularly in…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Jiafei Song , Fengwei Zhou , Jin Qu , Wenjin Jason Li , Tong Wu , Gengjian Xue , Zhikang Zhao , Daomin Wei , Yichao Lu , Bailin Na

The advent of real-time large multimodal models (LMMs) like GPT-4o has sparked considerable interest in efficient LMMs. LMM frameworks typically encode visual inputs into vision tokens (continuous representations) and integrate them and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Shaolei Zhang , Qingkai Fang , Zhe Yang , Yang Feng

Large vision-language models (LVLMs) achieve strong multimodal understanding, but their inference cost grows rapidly with the number of visual tokens, especially for high-resolution images and long videos. Existing attention-based methods…

Computer Vision and Pattern Recognition · Computer Science 2026-05-18 Hongyu Lu , Feng Zhang , Wenwei Jin , Huanling Hu , Tianjun Shi , Shikai Jiang , Yao Hu , Jiawei Li

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

Recent methods have made notable progress in accelerating Large Vision-Language Models (LVLMs) by exploiting the inherent redundancy in visual inputs. Most existing approaches, however, focus narrowly on reducing image tokens before or…

Computer Vision and Pattern Recognition · Computer Science 2026-03-10 Lianyu Hu , Liqing Gao , Fanhua Shang , Liang Wan , Wei Feng

Visual language models encounter challenges in computational efficiency and latency, primarily due to the substantial redundancy in the token representations of high-resolution images and videos. Current attention/similarity-based…

Computer Vision and Pattern Recognition · Computer Science 2025-12-11 Dehua Zheng , Mouxiao Huang , Borui Jiang , Hailin Hu , Xinghao Chen

Multimodal large language models (MLLMs) have demonstrated great performance on visual question answering (VQA). When it comes to knowledge-based Visual Question Answering (KB-VQA), MLLMs may lack the specialized domain knowledge needed to…

Computer Vision and Pattern Recognition · Computer Science 2025-02-04 Weixi Weng , Jieming Zhu , Xiaojun Meng , Hao Zhang , Rui Zhang , Chun Yuan

Vision Language Models (VLMs) have demonstrated strong capabilities across various visual understanding and reasoning tasks, driven by incorporating image representations into the token inputs of Large Language Models (LLMs). However, their…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Kevin Y. Li , Sachin Goyal , Joao D. Semedo , J. Zico Kolter

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ao Wang , Fengyuan Sun , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts,…

Computer Vision and Pattern Recognition · Computer Science 2026-02-26 Junjie Chen , Xuyang Liu , Zichen Wen , Yiyu Wang , Siteng Huang , Honggang Chen

Vision-language models (VLMs) have achieved remarkable multimodal understanding and reasoning capabilities, yet remain computationally expensive due to dense visual tokenization. Existing efficiency approaches either merge redundant visual…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Aditya Kumar Singh , Hitesh Kandala , Pratik Prabhanjan Brahma , Zicheng Liu , Emad Barsoum

Existing Multimodal Large Language Models (MLLMs) process a large number of visual tokens, leading to significant computational costs and inefficiency. Instruction-related visual token compression demonstrates strong task relevance, which…

Computer Vision and Pattern Recognition · Computer Science 2026-05-05 Lei Lei , Jie Gu , Xiaokang Ma , Chu Tang , Jingmin Chen , Tong Xu

Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos.…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Xubing Ye , Yukang Gan , Xiaoke Huang , Yixiao Ge , Yansong Tang

Although Large Vision Language Models (LVLMs) have demonstrated impressive multimodal reasoning capabilities, their scalability and deployment are constrained by massive computational requirements. In particular, the massive amount of…

Machine Learning · Computer Science 2026-04-14 Surendra Pathak , Bo Han
‹ Prev 1 2 3 10 Next ›