Related papers: BREEN: Bridge Data-Efficient Encoder-Free Multimod…

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models…

Computer Vision and Pattern Recognition · Computer Science 2024-12-13 Hao Li , Changyao Tian , Jie Shao , Xizhou Zhu , Zhaokai Wang , Jinguo Zhu , Wenhan Dou , Xiaogang Wang , Hongsheng Li , Lewei Lu , Jifeng Dai

Unveiling Encoder-Free Vision-Language Models

Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting…

Computer Vision and Pattern Recognition · Computer Science 2024-10-30 Haiwen Diao , Yufeng Cui , Xiaotong Li , Yueze Wang , Huchuan Lu , Xinlong Wang

METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models

Vision encoders serve as the cornerstone of multimodal understanding. Single-encoder architectures like CLIP exhibit inherent constraints in generalizing across diverse multimodal tasks, while recent multi-encoder fusion methods introduce…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Yuchen Liu , Yaoming Wang , Bowen Shi , Xiaopeng Zhang , Wenrui Dai , Chenglin Li , Hongkai Xiong , Qi Tian

Frozen Transformers in Language Models Are Effective Visual Encoder Layers

This paper reveals that large language models (LLMs), despite being trained solely on textual data, are surprisingly strong encoders for purely visual tasks in the absence of language. Even more intriguingly, this can be achieved by a…

Computer Vision and Pattern Recognition · Computer Science 2024-05-07 Ziqi Pang , Ziyang Xie , Yunze Man , Yu-Xiong Wang

AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning

Large language models (LLMs) have enabled the creation of multi-modal LLMs that exhibit strong comprehension of visual data such as images and videos. However, these models usually rely on extensive visual tokens from visual encoders,…

Computer Vision and Pattern Recognition · Computer Science 2025-07-30 Yiwu Zhong , Zhuoming Liu , Yin Li , Liwei Wang

Sequential Modeling Enables Scalable Learning for Large Vision Models

We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images…

Computer Vision and Pattern Recognition · Computer Science 2023-12-04 Yutong Bai , Xinyang Geng , Karttikeya Mangalam , Amir Bar , Alan Yuille , Trevor Darrell , Jitendra Malik , Alexei A Efros

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Wanpeng Zhang , Yicheng Feng , Hao Luo , Yijiang Li , Zihao Yue , Sipeng Zheng , Zongqing Lu

BRAVE: Broadening the visual encoding of vision-language models

Vision-language models (VLMs) are typically composed of a vision encoder, e.g. CLIP, and a language model (LM) that interprets the encoded features to solve downstream tasks. Despite remarkable progress, VLMs are subject to several…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Oğuzhan Fatih Kar , Alessio Tonioni , Petra Poklukar , Achin Kulshrestha , Amir Zamir , Federico Tombari

$\mathcal{V}isi\mathcal{P}runer$: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs

Multimodal Large Language Models (MLLMs) have achieved strong performance across vision-language tasks, but suffer from significant computational overhead due to the quadratic growth of attention computations with the number of multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-10-21 Yingqi Fan , Anhao Zhao , Jinlan Fu , Junlong Tong , Hui Su , Yijie Pan , Wei Zhang , Xiaoyu Shen

Growing Visual Generative Capacity for Pre-Trained MLLMs

Multimodal large language models (MLLMs) extend the success of language models to visual understanding, and recent efforts have sought to build unified MLLMs that support both understanding and generation. However, constructing such models…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Hanyu Wang , Jiaming Han , Ziyan Yang , Qi Zhao , Shanchuan Lin , Xiangyu Yue , Abhinav Shrivastava , Zhenheng Yang , Hao Chen

LFTR: Learning-Free Token Reduction for Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) have demonstrated exceptional success in various multimodal tasks, yet their deployment is frequently limited by substantial computational demands and prolonged inference times. Given that the vision…

Computer Vision and Pattern Recognition · Computer Science 2025-10-01 Zihui Zhao , Yingxin Li , Yang Li

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves…

Computer Vision and Pattern Recognition · Computer Science 2025-03-04 Min Shi , Fuxiao Liu , Shihao Wang , Shijia Liao , Subhashree Radhakrishnan , Yilin Zhao , De-An Huang , Hongxu Yin , Karan Sapra , Yaser Yacoob , Humphrey Shi , Bryan Catanzaro , Andrew Tao , Jan Kautz , Zhiding Yu , Guilin Liu

Breaking the Encoder Barrier for Seamless Video-Language Understanding

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces…

Computer Vision and Pattern Recognition · Computer Science 2025-11-06 Handong Li , Yiyuan Zhang , Longteng Guo , Xiangyu Yue , Jing Liu

Efficient Multimodal Learning from Data-centric Perspective

Multimodal Large Language Models (MLLMs) have demonstrated notable capabilities in general visual understanding and reasoning tasks. However, their deployment is hindered by substantial computational costs in both training and inference,…

Computer Vision and Pattern Recognition · Computer Science 2024-07-23 Muyang He , Yexin Liu , Boya Wu , Jianhao Yuan , Yueze Wang , Tiejun Huang , Bo Zhao

[CLS] Token Tells Everything Needed for Training-free Efficient MLLMs

Multimodal Large Language Models (MLLMs) have recently demonstrated strong performance across a wide range of vision-language tasks, garnering significant attention in the computer vision. However, their efficient deployment remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Ao Wang , Fengyuan Sun , Hui Chen , Zijia Lin , Jungong Han , Guiguang Ding

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

QLIP: A Dynamic Quadtree Vision Prior Enhances MLLM Performance Without Retraining

Multimodal Large Language Models (MLLMs) encode images into visual tokens, aligning visual and textual signals within a shared latent space to facilitate crossmodal representation learning. The CLIP model is a widely adopted foundational…

Machine Learning · Computer Science 2026-03-27 Kyle R. Chickering , Bangzheng Li , Muhao Chen

Less is More: A Simple yet Effective Token Reduction Method for Efficient Multi-modal LLMs

The rapid advancement of Multimodal Large Language Models (MLLMs) has led to remarkable performances across various domains. However, this progress is accompanied by a substantial surge in the resource consumption of these models. We…

Computation and Language · Computer Science 2024-12-19 Dingjie Song , Wenjun Wang , Shunian Chen , Xidong Wang , Michael Guan , Benyou Wang

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

Vision-language large models have achieved remarkable success in various multi-modal tasks, yet applying them to video understanding remains challenging due to the inherent complexity and computational demands of video data. While…

Computer Vision and Pattern Recognition · Computer Science 2024-10-17 Kai Han , Jianyuan Guo , Yehui Tang , Wei He , Enhua Wu , Yunhe Wang

Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning

This paper proposes a novel framework for multi-label image recognition without any training data, called data-free framework, which uses knowledge of pre-trained Large Language Model (LLM) to learn prompts to adapt pretrained…

Computer Vision and Pattern Recognition · Computer Science 2024-03-05 Shuo Yang , Zirui Shang , Yongqi Wang , Derong Deng , Hongwei Chen , Qiyuan Cheng , Xinxiao Wu