English

Large Multimodal Models: Notes on CVPR 2023 Tutorial

Computer Vision and Pattern Recognition 2023-06-27 v1

Abstract

This tutorial note summarizes the presentation on ``Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4'', a part of CVPR 2023 tutorial on ``Recent Advances in Vision Foundation Models''. The tutorial consists of three parts. We first introduce the background on recent GPT-like large models for vision-and-language modeling to motivate the research in instruction-tuned large multimodal models (LMMs). As a pre-requisite, we describe the basics of instruction-tuning in large language models, which is further extended to the multimodal space. Lastly, we illustrate how to build the minimum prototype of multimodal GPT-4 like models with the open-source resource, and review the recently emerged topics.

Keywords

Cite

@article{arxiv.2306.14895,
  title  = {Large Multimodal Models: Notes on CVPR 2023 Tutorial},
  author = {Chunyuan Li},
  journal= {arXiv preprint arXiv:2306.14895},
  year   = {2023}
}

Comments

27 pages, 24 figures; Tutorial website: https://vlp-tutorial.github.io/

R2 v1 2026-06-28T11:14:51.232Z