Related papers: Optimizing Vision-Language Interactions Through De…

MOVE: A Mixture-of-Vision-Encoders Approach for Domain-Focused Vision-Language Processing

Multimodal language models (MLMs) integrate visual and textual information by coupling a vision encoder with a large language model through the specific adapter. While existing approaches commonly rely on a single pre-trained vision…

Computer Vision and Pattern Recognition · Computer Science 2025-02-24 Matvey Skripkin , Elizaveta Goncharova , Dmitrii Tarasov , Andrey Kuznetsov

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations

Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap…

Computer Vision and Pattern Recognition · Computer Science 2025-04-22 Ziyang Zhang , Yang Yu , Yucheng Chen , Xulei Yang , Si Yong Yeo

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering

The advancement of Multimodal Large Language Models (MLLMs) has driven significant progress in Visual Question Answering (VQA), evolving from Single to Multi Image VQA (MVQA). However, the increased number of images in MVQA inevitably…

Computer Vision and Pattern Recognition · Computer Science 2025-08-26 Kang Zeng , Guojin Zhong , Jintao Cheng , Jin Yuan , Zhiyong Li

MdaIF: Robust One-Stop Multi-Degradation-Aware Image Fusion with Language-Driven Semantics

Infrared and visible image fusion aims to integrate complementary multi-modal information into a single fused result. However, existing methods 1) fail to account for the degradation visible images under adverse weather conditions, thereby…

Computer Vision and Pattern Recognition · Computer Science 2025-11-18 Jing Li , Yifan Wang , Jiafeng Yan , Renlong Zhang , Bin Yang

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling

Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still…

Computer Vision and Pattern Recognition · Computer Science 2022-06-16 Linjie Li , Zhe Gan , Kevin Lin , Chung-Ching Lin , Zicheng Liu , Ce Liu , Lijuan Wang

Unified Multimodal Understanding via Byte-Pair Visual Encoding

Multimodal large language models (MLLMs) have made significant progress in vision-language understanding, yet effectively aligning different modalities remains a fundamental challenge. We present a framework that unifies multimodal…

Computer Vision and Pattern Recognition · Computer Science 2025-07-01 Wanpeng Zhang , Yicheng Feng , Hao Luo , Yijiang Li , Zihao Yue , Sipeng Zheng , Zongqing Lu

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

The development of language models have moved from encoder-decoder to decoder-only designs. In addition, we observe that the two most popular multimodal tasks, the generative and contrastive tasks, are nontrivial to accommodate in one…

Computer Vision and Pattern Recognition · Computer Science 2023-08-10 Weicheng Kuo , AJ Piergiovanni , Dahun Kim , Xiyang Luo , Ben Caine , Wei Li , Abhijit Ogale , Luowei Zhou , Andrew Dai , Zhifeng Chen , Claire Cui , Anelia Angelova

MoVA: Adapting Mixture of Vision Experts to Multimodal Context

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Zhuofan Zong , Bingqi Ma , Dazhong Shen , Guanglu Song , Hao Shao , Dongzhi Jiang , Hongsheng Li , Yu Liu

EMMA: Efficient Visual Alignment in Multi-Modal LLMs

Multi-modal Large Language Models (MLLMs) have recently exhibited impressive general-purpose capabilities by leveraging vision foundation models to encode the core concepts of images into representations. These are then combined with…

Computer Vision and Pattern Recognition · Computer Science 2025-06-12 Sara Ghazanfari , Alexandre Araujo , Prashanth Krishnamurthy , Siddharth Garg , Farshad Khorrami

MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models

Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet their positional encoding mechanisms remain suboptimal. Existing approaches uniformly assign positional indices to all tokens, overlooking…

Computer Vision and Pattern Recognition · Computer Science 2026-04-15 Ruoxiang Huang , Zhen Yuan

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

Despite the impressive advancements of Large Vision-Language Models (LVLMs), existing approaches suffer from a fundamental bottleneck: inefficient visual-language integration. Current methods either disrupt the model's inherent structure or…

Computer Vision and Pattern Recognition · Computer Science 2025-06-23 Tongtian Yue , Longteng Guo , Yepeng Tang , Zijia Zhao , Xinxin Zhu , Hua Huang , Jing Liu

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning

Recent advancements in multimodal fusion have witnessed the remarkable success of vision-language (VL) models, which excel in various multimodal applications such as image captioning and visual question answering. However, building VL…

Computer Vision and Pattern Recognition · Computer Science 2024-10-24 Zhiwei Hao , Jianyuan Guo , Li Shen , Yong Luo , Han Hu , Yonggang Wen

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

Recently, Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often…

Computer Vision and Pattern Recognition · Computer Science 2025-06-03 Wayner Barrios , Andrés Villa , Juan León Alcázar , SouYoung Jin , Bernard Ghanem

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding

We introduce MUSE-VL, a Unified Vision-Language Model through Semantic discrete Encoding for multimodal understanding and generation. Recently, the research community has begun exploring unified models for visual generation and…

Computer Vision and Pattern Recognition · Computer Science 2025-07-29 Rongchang Xie , Chen Du , Ping Song , Chang Liu

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified…

Artificial Intelligence · Computer Science 2025-08-14 Zixian Guo , Ming Liu , Qilong Wang , Zhilong Ji , Jinfeng Bai , Lei Zhang , Wangmeng Zuo

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features…

Computer Vision and Pattern Recognition · Computer Science 2025-01-20 Xu Li , Yi Zheng , Haotian Chen , Xiaolei Chen , Yuxuan Liang , Chenghang Lai , Bin Li , Xiangyang Xue

Unified modality separation: A vision-language framework for unsupervised domain adaptation

Unsupervised domain adaptation (UDA) enables models trained on a labeled source domain to handle new unlabeled domains. Recently, pre-trained vision-language models (VLMs) have demonstrated promising zero-shot performance by leveraging…

Computer Vision and Pattern Recognition · Computer Science 2025-08-08 Xinyao Li , Jingjing Li , Zhekai Du , Lei Zhu , Heng Tao Shen

Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts

Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist…

Computer Vision and Pattern Recognition · Computer Science 2023-02-20 Zhihong Chen , Shizhe Diao , Benyou Wang , Guanbin Li , Xiang Wan

ViCA: Efficient Multimodal LLMs with Vision-Only Cross-Attention

Modern multimodal large language models (MLLMs) adopt a unified self-attention design that processes visual and textual tokens at every Transformer layer, incurring substantial computational overhead. In this work, we revisit the necessity…

Computer Vision and Pattern Recognition · Computer Science 2026-05-28 Wenjie Liu , Hao Wu , Xin Qiu , Xudong Wang , Yingqi Fan , Yihan Zhang , Anhao Zhao , Yunpu Ma , Xiaoyu Shen

Can We Talk Models Into Seeing the World Differently?

Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the…

Computer Vision and Pattern Recognition · Computer Science 2025-03-07 Paul Gavrikov , Jovita Lukasik , Steffen Jung , Robert Geirhos , M. Jehanzeb Mirza , Margret Keuper , Janis Keuper