English
Related papers

Related papers: FOLDER: Accelerating Multi-modal Large Language Mo…

200 papers

The visual projector serves as an essential bridge between the visual encoder and the Large Language Model (LLM) in a Multimodal LLM (MLLM). Typically, MLLMs adopt a simple MLP to preserve all visual contexts via one-to-one transformation.…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Wentong Li , Yuqian Yuan , Jian Liu , Dongqi Tang , Song Wang , Jie Qin , Jianke Zhu , Lei Zhang

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

Multimodal large language models (MLLMs) demand considerable computations for inference due to the extensive parameters and the additional input tokens needed for visual information representation. Herein, we introduce Visual Tokens…

Computer Vision and Pattern Recognition · Computer Science 2025-01-28 Zhihang Lin , Mingbao Lin , Luxi Lin , Rongrong Ji

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this…

Computer Vision and Pattern Recognition · Computer Science 2024-11-19 Jieneng Chen , Luoxin Ye , Ju He , Zhao-Yang Wang , Daniel Khashabi , Alan Yuille

Multimodal Large Language Models (MLLMs) suffer from high computational costs due to their massive size and the large number of visual tokens. In this paper, we investigate layer-wise redundancy in MLLMs by introducing a novel metric, Layer…

Computer Vision and Pattern Recognition · Computer Science 2025-11-04 Qianhao Yuan , Qingyu Zhang , Yanjiang Liu , Jiawei Chen , Yaojie Lu , Hongyu Lin , Jia Zheng , Xianpei Han , Le Sun

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Shiwei Wu , Joya Chen , Kevin Qinghong Lin , Qimeng Wang , Yan Gao , Qianli Xu , Tong Xu , Yao Hu , Enhong Chen , Mike Zheng Shou

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual…

Computer Vision and Pattern Recognition · Computer Science 2024-08-27 Yujie Lu , Xiujun Li , Tsu-Jui Fu , Miguel Eckstein , William Yang Wang

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference…

Computation and Language · Computer Science 2025-06-03 Guoxuan Chen , Han Shi , Jiawei Li , Yihang Gao , Xiaozhe Ren , Yimeng Chen , Xin Jiang , Zhenguo Li , Weiyang Liu , Chao Huang

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe

Multimodal large language models (MLLMs) improve performance on vision-language tasks by integrating visual features from pre-trained vision encoders into large language models (LLMs). However, how MLLMs process and utilize visual…

Computer Vision and Pattern Recognition · Computer Science 2025-03-18 Hao Yin , Guangzong Si , Zilei Wang

Multimodal large language models (MLLMs) have achieved impressive performance, but high-resolution visual inputs result in long sequences of visual tokens and substantial inference latency. Reducing redundant visual tokens is critical to…

Computer Vision and Pattern Recognition · Computer Science 2026-03-23 Guoyang Xia , Yifeng Ding , Fengfa Li , Lei Ren , Wei Chen , Fangxiang Feng , Xiaojie Wang

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yingqi Fan , Anhao Zhao , Jinlan Fu , Junlong Tong , Hui Su , Yijie Pan , Wei Zhang , Xiaoyu Shen

Multimodal large language models (MLLMs) have shown promising potential in Vision-Language Navigation (VLN). However, their practical development is severely hindered by the substantial training overhead. We recognize two key issues that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-12 Duo Zheng , Shijia Huang , Yanyang Li , Liwei Wang

Multimodal large language models (MLLMs) have recently demonstrated strong capabilities in understanding and generating responses from diverse visual inputs, including high-resolution images and long video sequences. As these models scale…

Computer Vision and Pattern Recognition · Computer Science 2026-04-21 Junwan Kim , Hyunkyung Bae

Large language models (LLMs) have proven their remarkable versatility in handling a comprehensive range of language-centric applications. To expand LLMs' capabilities to a broader spectrum of modal inputs, multimodal large language models…

Computer Vision and Pattern Recognition · Computer Science 2023-12-07 Qiang Zhou , Zhibin Wang , Wei Chu , Yinghui Xu , Hao Li , Yuan Qi

Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address…

Computer Vision and Pattern Recognition · Computer Science 2024-12-03 Yingen Liu , Fan Wu , Ruihui Li , Zhuo Tang , Kenli Li
‹ Prev 1 2 3 10 Next ›