Related papers: Text-centric Alignment for Multi-Modality Learning

Enhance Modality Robustness in Text-Centric Multimodal Alignment with Adversarial Prompting

Converting different modalities into generalized text, which then serves as input prompts for large language models (LLMs), is a common approach for aligning multimodal models, particularly when pairwise data is limited. Text-centric…

Machine Learning · Computer Science 2024-08-20 Yun-Da Tsai , Ting-Yu Yen , Keng-Te Liao , Shou-De Lin

Enhance the Robustness of Text-Centric Multimodal Alignments

Converting different modalities into general text, serving as input prompts for large language models (LLMs), is a common method to align multimodal models when there is limited pairwise data. This text-centric approach leverages the unique…

Computation and Language · Computer Science 2024-07-09 Ting-Yu Yen , Yun-Da Tsai , Keng-Te Liao , Shou-De Lin

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we…

Computation and Language · Computer Science 2024-01-05 Zhen Yang , Yingxue Zhang , Fandong Meng , Jie Zhou

DALR: Dual-level Alignment Learning for Multimodal Sentence Representation Learning

Previous multimodal sentence representation learning methods have achieved impressive performance. However, most approaches focus on aligning images and text at a coarse level, facing two critical challenges:cross-modal misalignment bias…

Computation and Language · Computer Science 2025-07-02 Kang He , Yuzhe Ding , Haining Wang , Fei Li , Chong Teng , Donghong Ji

How to Bridge the Gap between Modalities: Survey on Multimodal Large Language Model

We explore Multimodal Large Language Models (MLLMs), which integrate LLMs like GPT-4 to handle multimodal data, including text, images, audio, and more. MLLMs demonstrate capabilities such as generating image captions and answering…

Computation and Language · Computer Science 2025-01-09 Shezheng Song , Xiaopeng Li , Shasha Li , Shan Zhao , Jie Yu , Jun Ma , Xiaoguang Mao , Weimin Zhang

Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and…

Computer Vision and Pattern Recognition · Computer Science 2025-11-27 Thanh-Dat Truong , Huu-Thien Tran , Tran Thai Son , Bhiksha Raj , Khoa Luu

Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically,…

Artificial Intelligence · Computer Science 2026-01-09 Guanyu Yao , Qiucheng Wu , Yang Zhang , Zhaowen Wang , Handong Zhao , Shiyu Chang

Some Modalities are More Equal Than Others: Decoding and Architecting Multimodal Integration in MLLMs

Despite remarkable advancements in Multimodal Large Language Models (MLLMs), a fundamental question remains: are MLLMs robust to contradicting modalities? To rigorously study this, we introduce MMA-Bench comprising videos and tasks that…

Computer Vision and Pattern Recognition · Computer Science 2025-12-04 Tianle Chen , Chaitanya Chakka , Arjun Reddy Akula , Xavier Thomas , Deepti Ghadiyaram

UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models

Multimodal Entity Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to the referent entities in a multimodal knowledge base, such as Wikipedia. Existing methods focus heavily on using complex…

Artificial Intelligence · Computer Science 2024-08-22 Liu Qi , He Yongyi , Lian Defu , Zheng Zhi , Xu Tong , Liu Che , Chen Enhong

NoteLLM-2: Multimodal Large Representation Models for Recommendation

Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks. However, their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains…

Information Retrieval · Computer Science 2025-01-22 Chao Zhang , Haoxin Zhang , Shiwei Wu , Di Wu , Tong Xu , Xiangyu Zhao , Yan Gao , Yao Hu , Enhong Chen

BALM-TSF: Balanced Multimodal Alignment for LLM-Based Time Series Forecasting

Time series forecasting is a long-standing and highly challenging research topic. Recently, driven by the rise of large language models (LLMs), research has increasingly shifted from purely time series methods toward harnessing textual…

Artificial Intelligence · Computer Science 2025-09-03 Shiqiao Zhou , Holger Schöner , Huanbo Lyu , Edouard Fouché , Shuo Wang

Toward Robust Multimodal Learning using Multimodal Foundational Models

Existing multimodal sentiment analysis tasks are highly rely on the assumption that the training and test sets are complete multimodal data, while this assumption can be difficult to hold: the multimodal data are often incomplete in…

Computer Vision and Pattern Recognition · Computer Science 2024-01-26 Xianbing Zhao , Soujanya Poria , Xuejiao Li , Yixin Chen , Buzhou Tang

TAMTRL: Teacher-Aligned Reward Reshaping for Multi-Turn Reinforcement Learning in Long-Context Compression

The rapid progress of large language models (LLMs) has led to remarkable performance gains across a wide range of tasks. However, when handling long documents that exceed the model's context window limit, the entire context cannot be…

Computation and Language · Computer Science 2026-03-24 Li Wang , Yandong Wang , Xin Yu , Kui Zhang , Tianhao Peng , Wenjun Wu

Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal

We have seen remarkable success in representation learning and language models (LMs) using deep neural networks. Many studies aim to build the underlying connections among different modalities via the alignment and mappings at the token or…

Sound · Computer Science 2025-03-04 Daniel Chin , Gus Xia

Generalizing Large Language Model Usability Across Resource-Constrained

Large Language Models (LLMs) have achieved remarkable success across a wide range of natural language tasks, and recent efforts have sought to extend their capabilities to multimodal domains and resource-constrained environments. However,…

Machine Learning · Computer Science 2025-05-26 Yun-Da Tsai

CoMMIT: Coordinated Multimodal Instruction Tuning

Instruction tuning in multimodal large language models (MLLMs) generally involves cooperative learning between a backbone LLM and a feature encoder of non-text input modalities. The major challenge is how to efficiently find the synergy…

Machine Learning · Computer Science 2025-09-10 Xintong Li , Junda Wu , Tong Yu , Yu Wang , Xiang Chen , Jiuxiang Gu , Lina Yao , Julian McAuley , Jingbo Shang

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and…

Computation and Language · Computer Science 2025-03-20 Rui Yang , Lin Song , Yicheng Xiao , Runhui Huang , Yixiao Ge , Ying Shan , Hengshuang Zhao

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT)…

Computation and Language · Computer Science 2025-10-15 Trishna Chakraborty , Erfan Shayegani , Zikui Cai , Nael Abu-Ghazaleh , M. Salman Asif , Yue Dong , Amit K. Roy-Chowdhury , Chengyu Song

On the Performance of Multimodal Language Models

Instruction-tuned large language models (LLMs) have demonstrated promising zero-shot generalization capabilities across various downstream tasks. Recent research has introduced multimodal capabilities to LLMs by integrating independently…

Computation and Language · Computer Science 2023-11-29 Utsav Garg , Erhan Bas

Fine-tuning Multimodal Large Language Models for Product Bundling

Recent advances in product bundling have leveraged multimodal information through sophisticated encoders, but remain constrained by limited semantic understanding and a narrow scope of knowledge. Therefore, some attempts employ In-context…

Information Retrieval · Computer Science 2025-02-04 Xiaohao Liu , Jie Wu , Zhulin Tao , Yunshan Ma , Yinwei Wei , Tat-seng Chua