Related papers: Instruction Makes a Difference

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Haotian Liu , Chunyuan Li , Qingyang Wu , Yong Jae Lee

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Junke Wang , Lingchen Meng , Zejia Weng , Bo He , Zuxuan Wu , Yu-Gang Jiang

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Yanzhe Zhang , Ruiyi Zhang , Jiuxiang Gu , Yufan Zhou , Nedim Lipka , Diyi Yang , Tong Sun

Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models

Despite the significant success of Large Vision-Language models(LVLMs), these models still suffer hallucinations when describing images, generating answers that include non-existent objects. It is reported that these models tend to…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 Bin Li , Dehong Gao , Yeyuan Wang , Linbo Jin , Shanqing Yu , Xiaoyan Cai , Libin Yang

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

Multi-modal large language models (MLLMs) are expected to support multi-turn queries of interchanging image and text modalities in production. However, the current MLLMs trained with visual-question-answering (VQA) datasets could suffer…

Computation and Language · Computer Science 2024-11-06 Shengzhi Li , Rongyu Lin , Shichao Pei

Detecting and Preventing Hallucinations in Large Vision Language Models

Instruction tuned Large Vision Language Models (LVLMs) have significantly advanced in generalizing across a diverse set of multi-modal tasks, especially for Visual Question Answering (VQA). However, generating detailed responses that are…

Computer Vision and Pattern Recognition · Computer Science 2024-02-13 Anisha Gunjal , Jihan Yin , Erhan Bas

Instruction-Free Tuning of Large Vision Language Models for Medical Instruction Following

Large vision language models (LVLMs) have demonstrated impressive performance across a wide range of tasks. These capabilities largely stem from visual instruction tuning, which fine-tunes models on datasets consisting of curated…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Myeongkyun Kang , Soopil Kim , Xiaoxiao Li , Sang Hyun Park

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Guohao Sun , Can Qin , Huazhu Fu , Linwei Wang , Zhiqiang Tao

Instruction-Following Evaluation of Large Vision-Language Models

Following the initial flourishing of large language models (LLMs), there has been a surge in proposed large vision-language models (LVLMs) that integrate LLMs with vision capabilities. However, it has been observed that LVLMs, after tuning…

Computation and Language · Computer Science 2025-12-30 Daiki Shiono , Shumpei Miyawaki , Ryota Tanaka , Jun Suzuki

COCO is "ALL'' You Need for Visual Instruction Fine-tuning

Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and…

Computer Vision and Pattern Recognition · Computer Science 2024-01-18 Xiaotian Han , Yiqi Wang , Bohan Zhai , Quanzeng You , Hongxia Yang

DocVQA: A Dataset for VQA on Document Images

We present a new dataset for Visual Question Answering (VQA) on document images called DocVQA. The dataset consists of 50,000 questions defined on 12,000+ document images. Detailed analysis of the dataset in comparison with similar datasets…

Computer Vision and Pattern Recognition · Computer Science 2021-01-06 Minesh Mathew , Dimosthenis Karatzas , C. V. Jawahar

Generative Visual Instruction Tuning

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new…

Computer Vision and Pattern Recognition · Computer Science 2024-10-04 Jefferson Hernandez , Ruben Villegas , Vicente Ordonez

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue…

Computer Vision and Pattern Recognition · Computer Science 2024-03-21 Fuxiao Liu , Kevin Lin , Linjie Li , Jianfeng Wang , Yaser Yacoob , Lijuan Wang

HRVDA: High-Resolution Visual Document Assistant

Leveraging vast training data, multimodal large language models (MLLMs) have demonstrated formidable general visual comprehension capabilities and achieved remarkable performance across various tasks. However, their performance in visual…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Chaohu Liu , Kun Yin , Haoyu Cao , Xinghua Jiang , Xin Li , Yinsong Liu , Deqiang Jiang , Xing Sun , Linli Xu

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-25 Ryota Tanaka , Taichi Iki , Kyosuke Nishida , Kuniko Saito , Jun Suzuki

What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). In this paper, we aim to investigate a fundamental question: ''what makes for good visual…

Computer Vision and Pattern Recognition · Computer Science 2025-02-06 Yifan Du , Hangyu Guo , Kun Zhou , Wayne Xin Zhao , Jinpeng Wang , Chuyuan Wang , Mingchen Cai , Ruihua Song , Ji-Rong Wen

LLaVA-Video: Video Instruction Tuning With Synthetic Data

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Yuanhan Zhang , Jinming Wu , Wei Li , Bo Li , Zejun Ma , Ziwei Liu , Chunyuan Li

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object Completion

Large Vision-Language Models (LVLMs) have experienced significant advancements in recent years. However, their performance still falls short in tasks requiring deep visual perception, such as identifying subtle differences between images. A…

Computer Vision and Pattern Recognition · Computer Science 2025-08-07 Qingguo Hu , Ante Wang , Jia Song , Delai Qiu , Qingsong Liu , Jinsong Su

Less is More: High-value Data Selection for Visual Instruction Tuning

Visual instruction tuning is the key to building large vision language models~(LVLMs), which can greatly improve the task generalization and solving capabilities by learning a mixture of instruction data from diverse visual tasks. Previous…

Computation and Language · Computer Science 2024-10-11 Zikang Liu , Kun Zhou , Wayne Xin Zhao , Dawei Gao , Yaliang Li , Ji-Rong Wen

Rethinking Overlooked Aspects in Vision-Language Models

Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Yuan Liu , Le Tian , Xiao Zhou , Jie Zhou