Related papers: Visual Instruction Tuning

Generative Visual Instruction Tuning

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new…

Computer Vision and Pattern Recognition · Computer Science 2024-10-04 Jefferson Hernandez , Ruben Villegas , Vicente Ordonez

Instruction Tuning with GPT-4

Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are…

Computation and Language · Computer Science 2023-04-07 Baolin Peng , Chunyuan Li , Pengcheng He , Michel Galley , Jianfeng Gao

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Junke Wang , Lingchen Meng , Zejia Weng , Bo He , Zuxuan Wu , Yu-Gang Jiang

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and…

Computer Vision and Pattern Recognition · Computer Science 2023-12-29 Yanda Li , Chi Zhang , Gang Yu , Zhibin Wang , Bin Fu , Guosheng Lin , Chunhua Shen , Ling Chen , Yunchao Wei

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Yanzhe Zhang , Ruiyi Zhang , Jiuxiang Gu , Yufan Zhou , Nedim Lipka , Diyi Yang , Tong Sun

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can…

Machine Learning · Computer Science 2026-04-14 Lai Wei , Xiaozhe Li , Zihao Jiang , Weiran Huang , Lichao Sun

Rethinking Overlooked Aspects in Vision-Language Models

Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Yuan Liu , Le Tian , Xiao Zhou , Jie Zhou

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering.…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Rui Yang , Lin Song , Yanwei Li , Sijie Zhao , Yixiao Ge , Xiu Li , Ying Shan

LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs…

Robotics · Computer Science 2024-06-18 Dantong Niu , Yuvan Sharma , Giscard Biamby , Jerome Quenum , Yutong Bai , Baifeng Shi , Trevor Darrell , Roei Herzig

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of…

Computer Vision and Pattern Recognition · Computer Science 2023-06-02 Chunyuan Li , Cliff Wong , Sheng Zhang , Naoto Usuyama , Haotian Liu , Jianwei Yang , Tristan Naumann , Hoifung Poon , Jianfeng Gao

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters…

Computer Vision and Pattern Recognition · Computer Science 2023-09-19 Yadong Lu , Chunyuan Li , Haotian Liu , Jianwei Yang , Jianfeng Gao , Yelong Shen

ChartLlama: A Multimodal LLM for Chart Understanding and Generation

Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Yucheng Han , Chi Zhang , Xin Chen , Xu Yang , Zhibin Wang , Gang Yu , Bin Fu , Hanwang Zhang

MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction…

Computer Vision and Pattern Recognition · Computer Science 2024-07-01 Jihao Liu , Xin Huang , Jinliang Zheng , Boxiao Liu , Jia Wang , Osamu Yoshie , Yu Liu , Hongsheng Li

STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering

Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Guohao Sun , Can Qin , Huazhu Fu , Linwei Wang , Zhiqiang Tao

VIGC: Visual Instruction Generation and Correction

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Bin Wang , Fan Wu , Xiao Han , Jiahui Peng , Huaping Zhong , Pan Zhang , Xiaoyi Dong , Weijia Li , Wei Li , Jiaqi Wang , Conghui He

LLaVA-Video: Video Instruction Tuning With Synthetic Data

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Yuanhan Zhang , Jinming Wu , Wei Li , Bo Li , Zejun Ma , Ziwei Liu , Chunyuan Li

Do we Really Need Visual Instructions? Towards Visual Instruction-Free Fine-tuning for Large Vision-Language Models

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would…

Computation and Language · Computer Science 2025-02-18 Zikang Liu , Kun Zhou , Wayne Xin Zhao , Dawei Gao , Yaliang Li , Ji-Rong Wen

Improving Visual Storytelling with Multimodal Large Language Models

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and…

Computation and Language · Computer Science 2024-06-18 Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li , Xiang Wan , Benyou Wang

X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant…

Computation and Language · Computer Science 2024-08-05 Dongjae Shin , Hyeonseok Lim , Inho Won , Changsu Choi , Minjun Kim , Seungwoo Song , Hangyeol Yoo , Sangmin Kim , Kyungtae Lim