English
Related papers

Related papers: Improving Multimodal Large Language Models Using C…

200 papers

Recent advancements in Artificial Intelligence have led to the development of Multimodal Large Language Models (MLLMs). However, adapting these pre-trained models to dynamic data distributions and various tasks efficiently remains a…

Machine Learning · Computer Science 2025-03-05 Yukang Huo , Hao Tang

Continual learning (CL) has emerged as a pivotal paradigm to enable large language models (LLMs) to dynamically adapt to evolving knowledge and sequential tasks while mitigating catastrophic forgetting-a critical limitation of the static…

Computation and Language · Computer Science 2026-03-16 Hongyang Chen , Zhongwu Sun , Hongfei Ye , Kunchi Li , Xuemin Lin

In recent years, multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets, enabling them to generally understand images well. However, the inherent difficulty in explicitly…

Computer Vision and Pattern Recognition · Computer Science 2024-07-08 Yuanze Lin , Yunsheng Li , Dongdong Chen , Weijian Xu , Ronald Clark , Philip Torr , Lu Yuan

Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2024-06-07 Davide Caffagni , Federico Cocchi , Luca Barsellotti , Nicholas Moratelli , Sara Sarto , Lorenzo Baraldi , Lorenzo Baraldi , Marcella Cornia , Rita Cucchiara

In recent years, multimodal large language models (MLLMs) have shown remarkable capabilities in tasks like visual question answering and common sense reasoning, while visual perception models have made significant strides in perception…

Computer Vision and Pattern Recognition · Computer Science 2024-06-25 Guanqun Wang , Xinyu Wei , Jiaming Liu , Ray Zhang , Yichi Zhang , Kevin Zhang , Maurice Chong , Shanghang Zhang

Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art…

Computation and Language · Computer Science 2025-07-16 Hessa A. Alawwad , Anas Zafar , Areej Alhothali , Usman Naseem , Ali Alkhathlan , Amani Jamal

Language models (LMs) and their extension, vision-language models (VLMs), have achieved remarkable performance across various tasks. However, they still struggle with complex reasoning tasks that require multimodal or multilingual…

Machine Learning · Computer Science 2025-07-09 Wenyi Wu , Zixuan Song , Kun Zhou , Yifei Shao , Zhiting Hu , Biwei Huang

Instruction tuning is now a widely adopted approach to aligning large multimodal models (LMMs) to follow human intent. It unifies the data format of vision-language tasks, enabling multi-task joint training. However, vision-language tasks…

Machine Learning · Computer Science 2023-11-29 Jinghan He , Haiyun Guo , Ming Tang , Jinqiao Wang

In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating…

Computer Vision and Pattern Recognition · Computer Science 2024-06-27 Qi She , Junwen Pan , Xin Wan , Rui Zhang , Dawei Lu , Kai Huang

The rapid advancement of generative models has empowered modern AI systems to comprehend and produce highly sophisticated content, even achieving human-level performance in specific domains. However, these models are fundamentally…

Achieving deep alignment between vision and language remains a central challenge for Multimodal Large Language Models (MLLMs). These models often fail to fully leverage visual input, defaulting to strong language priors. Our approach first…

Computer Vision and Pattern Recognition · Computer Science 2025-07-03 Aarti Ghatkesar , Ganesh Venkatesh

Large language models (LLMs) have revolutionized various domains but still struggle with non-Latin scripts and low-resource languages. This paper addresses the critical challenge of improving multilingual performance without extensive…

Computation and Language · Computer Science 2025-01-08 Somnath Kumar , Vaibhav Balloli , Mercy Ranjit , Kabir Ahuja , Sunayana Sitaram , Kalika Bali , Tanuja Ganu , Akshay Nambi

Visual storytelling is an emerging field that combines images and narratives to create engaging and contextually rich stories. Despite its potential, generating coherent and emotionally resonant visual stories remains challenging due to the…

Computer Vision and Pattern Recognition · Computer Science 2024-07-04 Xiaochuan Lin , Xiangyong Chen

Continual learning is essential for medical image classification systems to adapt to dynamically evolving clinical environments. The integration of multimodal information can significantly enhance continual learning of image classes.…

Computer Vision and Pattern Recognition · Computer Science 2025-08-06 Jiantao Tan , Peixian Ma , Kanghao Chen , Zhiming Dai , Ruixuan Wang

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant…

Computation and Language · Computer Science 2024-08-05 Dongjae Shin , Hyeonseok Lim , Inho Won , Changsu Choi , Minjun Kim , Seungwoo Song , Hangyeol Yoo , Sangmin Kim , Kyungtae Lim

The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent…

Computation and Language · Computer Science 2024-09-27 Atsuki Yamaguchi , Aline Villavicencio , Nikolaos Aletras

Large Language Models (LLMs) have garnered significant attention due to their remarkable ability to process information across various languages. Despite their capabilities, they exhibit inconsistencies in handling identical queries in…

Computation and Language · Computer Science 2024-06-24 Yue Huang , Chenrui Fan , Yuan Li , Siyuan Wu , Tianyi Zhou , Xiangliang Zhang , Lichao Sun

Multimodal models typically combine a powerful large language model (LLM) with a vision encoder and are then trained on multimodal data via instruction tuning. While this process adapts LLMs to multimodal settings, it remains unclear…

Computer Vision and Pattern Recognition · Computer Science 2024-12-05 Neale Ratzlaff , Man Luo , Xin Su , Vasudev Lal , Phillip Howard

This paper presents several novel findings on the explainability of vision reflection in large multimodal models (LMMs). First, we show that prompting an LMM to verify the prediction of a specialized vision model can improve recognition…

Computer Vision and Pattern Recognition · Computer Science 2025-08-12 Guoyuan An , JaeYoon Kim , SungEui Yoon

Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language and generating coherent responses to various prompts or queries. These models far exceed the complexity of conventional neural…

Machine Learning · Computer Science 2024-12-05 Minghao Shao , Abdul Basit , Ramesh Karri , Muhammad Shafique
‹ Prev 1 2 3 10 Next ›