English
Related papers

Related papers: BREEN: Bridge Data-Efficient Encoder-Free Multimod…

200 papers

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Hao Li , Changyao Tian , Jie Shao , Xizhou Zhu , Zhaokai Wang , Jinguo Zhu , Wenhan Dou , Xiaogang Wang , Hongsheng Li , Lewei Lu , Jifeng Dai

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Haiwen Diao , Yufeng Cui , Xiaotong Li , Yueze Wang , Huchuan Lu , Xinlong Wang

Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Yuchen Liu , Yaoming Wang , Bowen Shi , Xiaopeng Zhang , Wenrui Dai , Chenglin Li , Hongkai Xiong , Qi Tian

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Yutong Bai , Xinyang Geng , Karttikeya Mangalam , Amir Bar , Alan Yuille , Trevor Darrell , Jitendra Malik , Alexei A Efros

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Wanpeng Zhang , Yicheng Feng , Hao Luo , Yijiang Li , Zihao Yue , Sipeng Zheng , Zongqing Lu

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Oğuzhan Fatih Kar , Alessio Tonioni , Petra Poklukar , Achin Kulshrestha , Amir Zamir , Federico Tombari

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yingqi Fan , Anhao Zhao , Jinlan Fu , Junlong Tong , Hui Su , Yijie Pan , Wei Zhang , Xiaoyu Shen

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hanyu Wang , Jiaming Han , Ziyan Yang , Qi Zhao , Shanchuan Lin , Xiangyu Yue , Abhinav Shrivastava , Zhenheng Yang , Hao Chen

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves…

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Handong Li , Yiyuan Zhang , Longteng Guo , Xiangyu Yue , Jing Liu

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Muyang He , Yexin Liu , Boya Wu , Jianhao Yuan , Yueze Wang , Tiejun Huang , Bo Zhao

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ao Wang , Fengyuan Sun , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational…

Machine Learning · Computer Science 2026-03-27 Kyle R. Chickering , Bangzheng Li , Muhao Chen

The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We…

Computation and Language · Computer Science 2024-12-19 Dingjie Song , Wenjun Wang , Shunian Chen , Xidong Wang , Michael Guan , Benyou Wang

Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While…

Computer Vision and Pattern Recognition · Computer Science 2024-10-17 Kai Han , Jianyuan Guo , Yehui Tang , Wei He , Enhua Wu , Yunhe Wang

This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Shuo Yang , Zirui Shang , Yongqi Wang , Derong Deng , Hongwei Chen , Qiyuan Cheng , Xinxiao Wu
‹ Prev 1 2 3 10 Next ›