Related papers: Large Multimodal Models: Notes on CVPR 2023 Tutori…

Multimodal Large Language Models and Tunings: Vision, Language, Sensors, Audio, and Beyond

This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the…

Computation and Language · Computer Science 2024-10-10 Soyeon Caren Han , Feiqi Cao , Josiah Poon , Roberto Navigli

GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation

While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Zhanyu Wang , Longyue Wang , Zhen Zhao , Minghao Wu , Chenyang Lyu , Huayang Li , Deng Cai , Luping Zhou , Shuming Shi , Zhaopeng Tu

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering.…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Rui Yang , Lin Song , Yanwei Li , Sijie Zhao , Yixiao Ge , Xiu Li , Ying Shan

Visual Instruction Tuning

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Haotian Liu , Chunyuan Li , Qingyang Wu , Yong Jae Lee

A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of…

Computation and Language · Computer Science 2023-06-13 Jeremy Gwinnup , Kevin Duh

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested…

Computer Vision and Pattern Recognition · Computer Science 2023-06-14 Tao Gong , Chengqi Lyu , Shilong Zhang , Yudong Wang , Miao Zheng , Qian Zhao , Kuikun Liu , Wenwei Zhang , Ping Luo , Kai Chen

Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Xiao Wang , Guangyao Chen , Guangwu Qian , Pengcheng Gao , Xiao-Yong Wei , Yaowei Wang , Yonghong Tian , Wen Gao

GPT-Based Models Meet Simulation: How to Efficiently Use Large-Scale Pre-Trained Language Models Across Simulation Tasks

The disruptive technology provided by large-scale pre-trained language models (LLMs) such as ChatGPT or GPT-4 has received significant attention in several application domains, often with an emphasis on high-level opportunities and…

Human-Computer Interaction · Computer Science 2023-06-27 Philippe J. Giabbanelli

Improving Visual Storytelling with Multimodal Large Language Models

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the…

Computer Vision and Pattern Recognition · Computer Science 2023-10-12 Zhengyuan Yang , Linjie Li , Kevin Lin , Jianfeng Wang , Chung-Ching Lin , Zicheng Liu , Lijuan Wang

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Kirolos Ataallah , Xiaoqian Shen , Eslam Abdelrahman , Essam Sleiman , Deyao Zhu , Jian Ding , Mohamed Elhoseiny

Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text. A Vision-Language-Consistency Analysis of VLLMs and Beyond

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Yan Zeng , Hanbo Zhang , Jiani Zheng , Jiangnan Xia , Guoqiang Wei , Yang Wei , Yuchen Zhang , Tao Kong

Multimodal Foundation Models: From Specialists to General-Purpose Assistants

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose…

Computer Vision and Pattern Recognition · Computer Science 2023-09-20 Chunyuan Li , Zhe Gan , Zhengyuan Yang , Jianwei Yang , Linjie Li , Lijuan Wang , Jianfeng Gao

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension…

Robotics · Computer Science 2024-01-10 Jiaqi Wang , Zihao Wu , Yiwei Li , Hanqi Jiang , Peng Shu , Enze Shi , Huawen Hu , Chong Ma , Yiheng Liu , Xuhui Wang , Yincheng Yao , Xuan Liu , Huaqin Zhao , Zhengliang Liu , Haixing Dai , Lin Zhao , Bao Ge , Xiang Li , Tianming Liu , Shu Zhang

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous…

Computer Vision and Pattern Recognition · Computer Science 2023-10-03 Deyao Zhu , Jun Chen , Xiaoqian Shen , Xiang Li , Mohamed Elhoseiny

A Survey on Multimodal Large Language Models

Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Shukang Yin , Chaoyou Fu , Sirui Zhao , Ke Li , Xing Sun , Tong Xu , Enhong Chen

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

The Revolution of Multimodal Large Language Models: A Survey

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

Multimodal ChatGPT for Medical Applications: an Experimental Study of GPT-4V

In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Zhiling Yan , Kai Zhang , Rong Zhou , Lifang He , Xiang Li , Lichao Sun