English
Related papers

Related papers: Efficient Multimodal Learning from Data-centric Pe…

200 papers

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Yizhang Jin , Jian Li , Yexin Liu , Tianjun Gu , Kai Wu , Zhengkai Jiang , Muyang He , Bo Zhao , Xin Tan , Zhenye Gan , Yabiao Wang , Chengjie Wang , Lizhuang Ma

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and…

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Minjie Zhu , Yichen Zhu , Xin Liu , Ning Liu , Zhiyuan Xu , Chaomin Shen , Yaxin Peng , Zhicai Ou , Feifei Feng , Jian Tang

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture,…

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run…

Machine Learning · Computer Science 2024-09-04 Jainaveen Sundaram , Ravi Iyer

We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language…

Machine Learning · Computer Science 2024-02-23 Baichuan Zhou , Ying Hu , Xi Weng , Junlong Jia , Jie Luo , Xien Liu , Ji Wu , Lei Huang

Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal…

Artificial Intelligence · Computer Science 2024-07-19 Tianyi Bai , Hao Liang , Binwang Wan , Yanran Xu , Xi Li , Shiyu Li , Ling Yang , Bozhou Li , Yifan Wang , Bin Cui , Ping Huang , Jiulong Shan , Conghui He , Binhang Yuan , Wentao Zhang

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data…

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into…

Computation and Language · Computer Science 2023-09-01 Yongqiang Zhao , Zhenyu Li , Feng Zhang , Xinhai Xu , Donghong Liu

As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost…

Computation and Language · Computer Science 2025-11-11 Mayank Saini , Arit Kumar Bishwas

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion…

Computation and Language · Computer Science 2026-02-03 Marian Lupascu , Ana-Cristina Rogoz , Mihai Sorin Stupariu , Radu Tudor Ionescu

Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Ricardo Gonzalez Penuela , Felipe Arias-Russi , Victor Capriles

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-22 Wenjun Huang , Jiakai Pan , Jiahao Tang , Yanyu Ding , Yifei Xing , Yuhe Wang , Zhengzhuo Wang , Jianguo Hu

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories:…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Feipeng Ma , Yizhou Zhou , Zheyu Zhang , Shilin Yan , Hebei Li , Zilong He , Siying Wu , Fengyun Rao , Yueyi Zhang , Xiaoyan Sun

The rapid advancement of Large Language Models (LLMs) has improved text understanding and generation but poses challenges in computational resources. This study proposes a curriculum learning-inspired, data-centric training strategy that…

Computation and Language · Computer Science 2024-05-14 Jisu Kim , Juhwan Lee

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and…

Computation and Language · Computer Science 2024-09-09 Jian Li , Weiheng Lu , Hao Fei , Meng Luo , Ming Dai , Min Xia , Yizhang Jin , Zhenye Gan , Ding Qi , Chaoyou Fu , Ying Tai , Wankou Yang , Yabiao Wang , Chengjie Wang
‹ Prev 1 2 3 10 Next ›