Related papers: MLLMs-Augmented Visual-Language Representation Lea…

Indexing Multimodal Language Models for Large-scale Image Retrieval

Multimodal Large Language Models (MLLMs) have demonstrated strong cross-modal reasoning capabilities, yet their potential for vision-only tasks remains underexplored. We investigate MLLMs as training-free similarity estimators for…

Computer Vision and Pattern Recognition · Computer Science 2026-04-16 Bahey Tharwat , Giorgos Kordopatis-Zilos , Pavel Suma , Ian Reid , Giorgos Tolias

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models

Large language models (LLMs) have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs.…

Computer Vision and Pattern Recognition · Computer Science 2025-06-27 Abdelrahman Abdelhamed , Mahmoud Afifi , Alec Go

Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning

The linguistic capabilities of Multimodal Large Language Models (MLLMs) are critical for their effective application across diverse tasks. This study aims to evaluate the performance of MLLMs on the VALSE benchmark, focusing on the efficacy…

Computation and Language · Computer Science 2024-07-18 Mustafa Dogan , Ilker Kesen , Iacer Calixto , Aykut Erdem , Erkut Erdem

Visual RAG: Expanding MLLM visual knowledge without fine-tuning

Multimodal Large Language Models (MLLMs) have achieved notable performance in computer vision tasks that require reasoning across visual and textual modalities, yet their capabilities are limited to their pre-trained data, requiring…

Computer Vision and Pattern Recognition · Computer Science 2025-01-22 Mirco Bonomo , Simone Bianco

Improving Visual Storytelling with Multimodal Large Language Models

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge

In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Yuanze Lin , Yunsheng Li , Dongdong Chen , Weijian Xu , Ronald Clark , Philip Torr , Lu Yuan

Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

Visually-Augmented Language Modeling

Human language is grounded on multimodal knowledge including visual knowledge like colors, sizes, and shapes. However, current large-scale pre-trained language models rely on text-only self-supervised training with massive text data, which…

Computation and Language · Computer Science 2023-02-28 Weizhi Wang , Li Dong , Hao Cheng , Haoyu Song , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

Visual Prompting in Multimodal Large Language Models: A Survey

Multimodal large language models (MLLMs) equip pre-trained large-language models (LLMs) with visual capabilities. While textual prompting in LLMs has been widely studied, visual prompting has emerged for more fine-grained and free-form…

Machine Learning · Computer Science 2024-09-25 Junda Wu , Zhehao Zhang , Yu Xia , Xintong Li , Zhaoyang Xia , Aaron Chang , Tong Yu , Sungchul Kim , Ryan A. Rossi , Ruiyi Zhang , Subrata Mitra , Dimitris N. Metaxas , Lina Yao , Jingbo Shang , Julian McAuley

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Guanqun Wang , Xinyu Wei , Jiaming Liu , Ray Zhang , Yichi Zhang , Kevin Zhang , Maurice Chong , Shanghang Zhang

VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization

This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on…

Artificial Intelligence · Computer Science 2024-06-21 Dongsheng Zhu , Xunzhu Tang , Weidong Han , Jinghui Lu , Yukun Zhao , Guoliang Xing , Junfeng Wang , Dawei Yin

Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey

In the past five years, research has shifted from traditional Machine Learning (ML) and Deep Learning (DL) approaches to leveraging Large Language Models (LLMs) , including multimodality, for data augmentation to enhance generalization, and…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Ranjan Sapkota , Shaina Raza , Maged Shoman , Achyut Paudel , Manoj Karkee

Data Efficient Masked Language Modeling for Vision and Language

Masked language modeling (MLM) is one of the key sub-tasks in vision-language pretraining. In the cross-modal setting, tokens in the sentence are masked at random, and the model predicts the masked tokens given the image and the text. In…

Computation and Language · Computer Science 2021-09-07 Yonatan Bitton , Gabriel Stanovsky , Michael Elhadad , Roy Schwartz

Multi-Modal Representation Learning with Text-Driven Soft Masks

We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Jaeyoo Park , Bohyung Han

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models

Recent advancements in dialogue systems have highlighted the significance of integrating multimodal responses, which enable conveying ideas through diverse modalities rather than solely relying on text-based interactions. This enrichment…

Computation and Language · Computer Science 2024-07-08 Chang-Sheng Kao , Yun-Nung Chen

Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning

Vision-language models (VLMs) allow to embed texts and images in a shared representation space. However, it has been shown that these models are subject to a modality gap phenomenon meaning there exists a clear separation between the…

Computer Vision and Pattern Recognition · Computer Science 2025-05-07 François Role , Sébastien Meyer , Victor Amblard

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture,…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Haotian Zhang , Mingfei Gao , Zhe Gan , Philipp Dufter , Nina Wenzel , Forrest Huang , Dhruti Shah , Xianzhi Du , Bowen Zhang , Yanghao Li , Sam Dodge , Keen You , Zhen Yang , Aleksei Timofeev , Mingze Xu , Hong-You Chen , Jean-Philippe Fauconnier , Zhengfeng Lai , Haoxuan You , Zirui Wang , Afshin Dehghan , Peter Grasch , Yinfei Yang

Personalizing Multimodal Large Language Models for Image Captioning: An Experimental Analysis

The task of image captioning demands an algorithm to generate natural language descriptions of visual inputs. Recent advancements have seen a convergence between image captioning research and the development of Large Language Models (LLMs)…

Computer Vision and Pattern Recognition · Computer Science 2024-12-06 Davide Bucciarelli , Nicholas Moratelli , Marcella Cornia , Lorenzo Baraldi , Rita Cucchiara

Piculet: Specialized Models-Guided Hallucination Decrease for MultiModal Large Language Models

Multimodal Large Language Models (MLLMs) have made significant progress in bridging the gap between visual and language modalities. However, hallucinations in MLLMs, where the generated text does not align with image content, continue to be…

Artificial Intelligence · Computer Science 2024-08-05 Kohou Wang , Xiang Liu , Zhaoxiang Liu , Kai Wang , Shiguo Lian