English
Related papers

Related papers: Large Multimodal Models: Notes on CVPR 2023 Tutori…

200 papers

This tutorial explores recent advancements in multimodal pretrained and large models, capable of integrating and processing diverse data forms such as text, images, audio, and video. Participants will gain an understanding of the…

Computation and Language · Computer Science 2024-10-10 Soyeon Caren Han , Feiqi Cao , Josiah Poon , Roberto Navigli

While the recent advances in Multimodal Large Language Models (MLLMs) constitute a significant leap forward in the field, these models are predominantly confined to the realm of input-side multimodal comprehension, lacking the capacity for…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Zhanyu Wang , Longyue Wang , Zhen Zhao , Minghao Wu , Chenyang Lyu , Huayang Li , Deng Cai , Luping Zhou , Shuming Shi , Zhaopeng Tu

This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools. Advanced proprietary LLMs, such as ChatGPT and GPT-4, have shown great potential for tool usage through sophisticated prompt engineering.…

Computer Vision and Pattern Recognition · Computer Science 2023-05-31 Rui Yang , Lin Song , Yanwei Li , Sijie Zhao , Yixiao Ge , Xiu Li , Ying Shan

Instruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first…

Computer Vision and Pattern Recognition · Computer Science 2023-12-14 Haotian Liu , Chunyuan Li , Qingyang Wu , Yong Jae Lee

Large language models such as BERT and the GPT series started a paradigm shift that calls for building general-purpose models via pre-training on large datasets, followed by fine-tuning on task-specific datasets. There is now a plethora of…

Computation and Language · Computer Science 2023-06-13 Jeremy Gwinnup , Kevin Duh

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested…

Computer Vision and Pattern Recognition · Computer Science 2023-06-14 Tao Gong , Chengqi Lyu , Shilong Zhang , Yudong Wang , Miao Zheng , Qian Zhao , Kuikun Liu , Wenwei Zhang , Ping Luo , Kai Chen

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as BERT, ViT, GPT, etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-11 Xiao Wang , Guangyao Chen , Guangwu Qian , Pengcheng Gao , Xiao-Yong Wei , Yaowei Wang , Yonghong Tian , Wen Gao

The disruptive technology provided by large-scale pre-trained language models (LLMs) such as ChatGPT or GPT-4 has received significant attention in several application domains, often with an emphasis on high-level opportunities and…

Human-Computer Interaction · Computer Science 2023-06-27 Philippe J. Giabbanelli

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

Large multimodal models (LMMs) extend large language models (LLMs) with multi-sensory skills, such as visual understanding, to achieve stronger generic intelligence. In this paper, we analyze the latest model, GPT-4V(ision), to deepen the…

Computer Vision and Pattern Recognition · Computer Science 2023-10-12 Zhengyuan Yang , Linjie Li , Kevin Lin , Jianfeng Wang , Chung-Ching Lin , Zicheng Liu , Lijuan Wang

This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed specifically for video understanding. The model is capable of processing both temporal visual and textual data, making it adept at understanding the…

Computer Vision and Pattern Recognition · Computer Science 2024-04-05 Kirolos Ataallah , Xiaoqian Shen , Eslam Abdelrahman , Essam Sleiman , Deyao Zhu , Jian Ding , Mohamed Elhoseiny

Recent advancements in multimodal techniques open exciting possibilities for models excelling in diverse tasks involving text, audio, and image processing. Models like GPT-4V, blending computer vision and language modeling, excel in complex…

Computation and Language · Computer Science 2023-10-20 Xiang Zhang , Senyu Li , Zijun Wu , Ning Shi

Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design…

Computer Vision and Pattern Recognition · Computer Science 2023-08-01 Yan Zeng , Hanbo Zhang , Jiani Zheng , Jiangnan Xia , Guoqiang Wei , Yang Wei , Yuchen Zhang , Tao Kong

This paper presents a comprehensive survey of the taxonomy and evolution of multimodal foundation models that demonstrate vision and vision-language capabilities, focusing on the transition from specialist models to general-purpose…

Computer Vision and Pattern Recognition · Computer Science 2023-09-20 Chunyuan Li , Zhe Gan , Zhengyuan Yang , Jianwei Yang , Linjie Li , Lijuan Wang , Jianfeng Gao

Large language models (LLMs) have undergone significant expansion and have been increasingly integrated across various domains. Notably, in the realm of robot task planning, LLMs harness their advanced reasoning and language comprehension…

The recent GPT-4 has demonstrated extraordinary multi-modal abilities, such as directly generating websites from handwritten text and identifying humorous elements within images. These features are rarely observed in previous…

Computer Vision and Pattern Recognition · Computer Science 2023-10-03 Deyao Zhu , Jun Chen , Xiaoqian Shen , Xiang Li , Mohamed Elhoseiny

Recently, Multimodal Large Language Model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful Large Language Models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of…

Computer Vision and Pattern Recognition · Computer Science 2024-12-02 Shukang Yin , Chaoyou Fu , Sirui Zhao , Ke Li , Xing Sun , Tong Xu , Enhong Chen

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

In this paper, we critically evaluate the capabilities of the state-of-the-art multimodal large language model, i.e., GPT-4 with Vision (GPT-4V), on Visual Question Answering (VQA) task. Our experiments thoroughly assess GPT-4V's…

Computer Vision and Pattern Recognition · Computer Science 2023-10-31 Zhiling Yan , Kai Zhang , Rong Zhou , Lifang He , Xiang Li , Lichao Sun
‹ Prev 1 2 3 10 Next ›