Related papers: Efficient Multimodal Learning from Data-centric Pe…

Efficient Multimodal Large Language Models: A Survey

In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and…

Computer Vision and Pattern Recognition · Computer Science 2026-01-23 Yizhang Jin , Jian Li , Yexin Liu , Tianjun Gu , Kai Wu , Zhengkai Jiang , Muyang He , Bo Zhao , Xin Tan , Zhenye Gan , Yabiao Wang , Chengjie Wang , Lizhuang Ma

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core bottleneck in making MLLMs more accessible and…

Machine Learning · Computer Science 2025-09-24 Tianyu Yu , Zefan Wang , Chongyi Wang , Fuwei Huang , Wenshuo Ma , Zhihui He , Tianchi Cai , Weize Chen , Yuxiang Huang , Yuanqian Zhao , Bokai Xu , Junbo Cui , Yingjing Xu , Liqing Ruan , Luoyuan Zhang , Hanyu Liu , Jingkun Tang , Hongyuan Liu , Qining Guo , Wenhao Hu , Bingxiang He , Jie Zhou , Jie Cai , Ji Qi , Zonghao Guo , Chi Chen , Guoyang Zeng , Yuxuan Li , Ganqu Cui , Ning Ding , Xu Han , Yuan Yao , Zhiyuan Liu , Maosong Sun

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models

Multimodal Large Language Models (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Minjie Zhu , Yichen Zhu , Xin Liu , Ning Liu , Zhiyuan Xu , Chaomin Shen , Yaxin Peng , Zhicai Ou , Feifei Feng , Jian Tang

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture,…

Computer Vision and Pattern Recognition · Computer Science 2024-10-01 Haotian Zhang , Mingfei Gao , Zhe Gan , Philipp Dufter , Nina Wenzel , Forrest Huang , Dhruti Shah , Xianzhi Du , Bowen Zhang , Yanghao Li , Sam Dodge , Keen You , Zhen Yang , Aleksei Timofeev , Mingze Xu , Hong-You Chen , Jean-Philippe Fauconnier , Zhengfeng Lai , Haoxuan You , Zirui Wang , Afshin Dehghan , Peter Grasch , Yinfei Yang

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run…

Machine Learning · Computer Science 2024-09-04 Jainaveen Sundaram , Ravi Iyer

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, language…

Machine Learning · Computer Science 2024-02-23 Baichuan Zhou , Ying Hu , Xi Weng , Junlong Jia , Jie Luo , Xien Liu , Ji Wu , Lei Huang

A Survey of Multimodal Large Language Model from A Data-centric Perspective

Multimodal large language models (MLLMs) enhance the capabilities of standard large language models by integrating and processing data from multiple modalities, including text, vision, audio, video, and 3D environments. Data plays a pivotal…

Artificial Intelligence · Computer Science 2024-07-19 Tianyi Bai , Hao Liang , Binwang Wan , Yanran Xu , Xi Li , Shiyu Li , Ling Yang , Bozhou Li , Yifan Wang , Bin Cui , Ping Huang , Jiulong Shan , Conghui He , Binhang Yuan , Wentao Zhang

Magic-MM-Embedding: Towards Visual-Token-Efficient Universal Multimodal Embedding with MLLMs

Multimodal Large Language Models (MLLMs) have shown immense promise in universal multimodal retrieval, which aims to find relevant items of various modalities for a given query. But their practical application is often hindered by the…

Computer Vision and Pattern Recognition · Computer Science 2026-02-06 Qi Li , Yanzhe Zhao , Yongxin Zhou , Yameng Wang , Yandong Yang , Yuanjia Zhou , Jue Wang , Zuojian Wang , Jinxiang Liu

A Comprehensive Review of Multimodal Large Language Models: Performance and Challenges Across Different Tasks

In an era defined by the explosive growth of data and rapid technological advancements, Multimodal Large Language Models (MLLMs) stand at the forefront of artificial intelligence (AI) systems. Designed to seamlessly integrate diverse data…

Artificial Intelligence · Computer Science 2024-08-05 Jiaqi Wang , Hanqi Jiang , Yiheng Liu , Chong Ma , Xu Zhang , Yi Pan , Mengyuan Liu , Peiran Gu , Sichen Xia , Wenjun Li , Yutong Zhang , Zihao Wu , Zhengliang Liu , Tianyang Zhong , Bao Ge , Tuo Zhang , Ning Qiang , Xintao Hu , Xi Jiang , Xin Zhang , Wei Zhang , Dinggang Shen , Tianming Liu , Shu Zhang

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into…

Computation and Language · Computer Science 2023-09-01 Yongqiang Zhao , Zhenyu Li , Feng Zhang , Xinhai Xu , Donghong Liu

Towards Resource-Efficient Multimodal Intelligence: Learned Routing among Specialized Expert Models

As AI moves beyond text, large language models (LLMs) increasingly power vision, audio, and document understanding; however, their high inference costs hinder real-time, scalable deployment. Conversely, smaller open-source models offer cost…

Computation and Language · Computer Science 2025-11-11 Mayank Saini , Arit Kumar Bishwas

The Revolution of Multimodal Large Language Models: A Survey

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

Large Multimodal Models for Low-Resource Languages: A Survey

In this survey, we systematically analyze techniques used to adapt large multimodal models (LMMs) for low-resource (LR) languages, examining approaches ranging from visual enhancement and data creation to cross-modal transfer and fusion…

Computation and Language · Computer Science 2026-02-03 Marian Lupascu , Ana-Cristina Rogoz , Mihai Sorin Stupariu , Radu Tudor Ionescu

Guiding Multimodal Large Language Models with Blind and Low Vision People Visual Questions for Proactive Visual Interpretations

Multimodal large language models (MLLMs) have been integrated into visual interpretation applications to support Blind and Low Vision (BLV) users because of their accuracy and ability to provide rich, human-like interpretations. However,…

Computer Vision and Pattern Recognition · Computer Science 2025-10-03 Ricardo Gonzalez Penuela , Felipe Arias-Russi , Victor Capriles

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this…

Computer Vision and Pattern Recognition · Computer Science 2024-08-22 Wenjun Huang , Jiakai Pan , Jiahao Tang , Yanyu Ding , Yifei Xing , Yuhe Wang , Zhengzhuo Wang , Jianguo Hu

EE-MLLM: A Data-Efficient and Compute-Efficient Multimodal Large Language Model

Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated satisfactory performance across various vision-language tasks. Current approaches for vision and language interaction fall into two categories:…

Computer Vision and Pattern Recognition · Computer Science 2025-04-08 Feipeng Ma , Yizhou Zhou , Zheyu Zhang , Shilin Yan , Hebei Li , Zilong He , Siying Wu , Fengyun Rao , Yueyi Zhang , Xiaoyan Sun

Strategic Data Ordering: Enhancing Large Language Model Performance through Curriculum Learning

The rapid advancement of Large Language Models (LLMs) has improved text understanding and generation but poses challenges in computational resources. This study proposes a curriculum learning-inspired, data-centric training strategy that…

Computation and Language · Computer Science 2024-05-14 Jisu Kim , Juhwan Lee

Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

A Survey on Benchmarks of Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and…

Computation and Language · Computer Science 2024-09-09 Jian Li , Weiheng Lu , Hao Fei , Meng Luo , Ming Dai , Min Xia , Yizhang Jin , Zhenye Gan , Ding Qi , Chaoyou Fu , Ying Tai , Wankou Yang , Yabiao Wang , Chengjie Wang