English
Related papers

Related papers: Visual Instruction Tuning

200 papers

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new…

Computer Vision and Pattern Recognition · Computer Science 2024-10-04 Jefferson Hernandez , Ruben Villegas , Vicente Ordonez

Prior work has shown that finetuning large language models (LLMs) using machine-generated instruction-following data enables such models to achieve remarkable zero-shot capabilities on new tasks, and no human-written instructions are…

Computation and Language · Computer Science 2023-04-07 Baolin Peng , Chunyuan Li , Pengcheng He , Michel Galley , Jianfeng Gao

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image…

Computer Vision and Pattern Recognition · Computer Science 2023-11-30 Junke Wang , Lingchen Meng , Zejia Weng , Bo He , Zuxuan Wu , Yu-Gang Jiang

The remarkable multimodal capabilities demonstrated by OpenAI's GPT-4 have sparked significant interest in the development of multimodal Large Language Models (LLMs). A primary research objective of such models is to align visual and…

Computer Vision and Pattern Recognition · Computer Science 2023-12-29 Yanda Li , Chi Zhang , Gang Yu , Zhibin Wang , Bin Fu , Guosheng Lin , Chunhua Shen , Ling Chen , Yunchao Wei

Instruction tuning unlocks the superior capability of Large Language Models (LLM) to interact with humans. Furthermore, recent instruction-following datasets include images as visual inputs, collecting responses for image-based…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Yanzhe Zhang , Ruiyi Zhang , Jiuxiang Gu , Yufan Zhou , Nedim Lipka , Diyi Yang , Tong Sun

Multimodal large language models are typically trained in two stages: first pre-training on image-text pairs, and then fine-tuning using supervised vision-language instruction data. Recent studies have shown that large language models can…

Machine Learning · Computer Science 2026-04-14 Lai Wei , Xiaozhe Li , Zihao Jiang , Weiran Huang , Lichao Sun

Recent advancements in large vision-language models (LVLMs), such as GPT4-V and LLaVA, have been substantial. LLaVA's modular architecture, in particular, offers a blend of simplicity and efficiency. Recent works mainly focus on introducing…

Computer Vision and Pattern Recognition · Computer Science 2024-05-21 Yuan Liu , Le Tian , Xiao Zhou , Jie Zhou

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering.…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Rui Yang , Lin Song , Yanwei Li , Sijie Zhao , Yixiao Ge , Xiu Li , Ying Shan

In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs…

Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigations focus on unimodal text. Multimodal conversational AI has seen rapid progress by leveraging billions of…

Computer Vision and Pattern Recognition · Computer Science 2023-06-02 Chunyuan Li , Cliff Wong , Sheng Zhang , Naoto Usuyama , Haotian Liu , Jianwei Yang , Tristan Naumann , Hoifung Poon , Jianfeng Gao

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters…

Computer Vision and Pattern Recognition · Computer Science 2023-09-19 Yadong Lu , Chunyuan Li , Haotian Liu , Jianwei Yang , Jianfeng Gao , Yelong Shen

Multi-modal large language models have demonstrated impressive performances on most vision-language tasks. However, the model generally lacks the understanding capabilities for specific domain data, particularly when it comes to…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Yucheng Han , Chi Zhang , Xin Chen , Xu Yang , Zhibin Wang , Gang Yu , Bin Fu , Hanwang Zhang

This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction…

Computer Vision and Pattern Recognition · Computer Science 2024-07-01 Jihao Liu , Xin Huang , Jinliang Zheng , Boxiao Liu , Jia Wang , Osamu Yoshie , Yu Liu , Hongsheng Li

Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on…

Computer Vision and Pattern Recognition · Computer Science 2024-10-28 Guohao Sun , Can Qin , Huazhu Fu , Linwei Wang , Zhiqiang Tao

The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Bin Wang , Fan Wu , Xiao Han , Jiahui Peng , Huaping Zhong , Pan Zhang , Xiaoyi Dong , Weijia Li , Wei Li , Jiaqi Wang , Conghui He

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality…

Computer Vision and Pattern Recognition · Computer Science 2025-08-04 Yuanhan Zhang , Jinming Wu , Wei Li , Bo Li , Zejun Ma , Ziwei Liu , Chunyuan Li

Visual instruction tuning has become the predominant technology in eliciting the multimodal task-solving capabilities of large vision-language models (LVLMs). Despite the success, as visual instructions require images as the input, it would…

Computation and Language · Computer Science 2025-02-18 Zikang Liu , Kun Zhou , Wayne Xin Zhao , Dawei Gao , Yaliang Li , Ji-Rong Wen

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and…

Computation and Language · Computer Science 2024-06-18 Guiming Hardy Chen , Shunian Chen , Ruifei Zhang , Junying Chen , Xiangbo Wu , Zhiyi Zhang , Zhihong Chen , Jianquan Li , Xiang Wan , Benyou Wang

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant…

Computation and Language · Computer Science 2024-08-05 Dongjae Shin , Hyeonseok Lim , Inho Won , Changsu Choi , Minjun Kim , Seungwoo Song , Hangyeol Yoo , Sangmin Kim , Kyungtae Lim
‹ Prev 1 2 3 10 Next ›