English
Related papers

Related papers: MAPLE: Modality-Aware Post-training and Learning E…

200 papers

Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Muhammad Uzair Khattak , Hanoona Rasheed , Muhammad Maaz , Salman Khan , Fahad Shahbaz Khan

Audio and omni-modal large language models exhibit impressive cross-modal reasoning capabilities. However, applying standard reinforcement learning post-training algorithms to these models exposes a critical structural vulnerability:…

Computation and Language · Computer Science 2026-05-28 Cihan Xiao , Yiwen Shao , Chenxing Li , Xiang He , Zhenwen Liang , Steve Yves , Sanjeev Khudanpur , Liefeng Bo

Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Oscar Mañas , Pau Rodriguez , Saba Ahmadi , Aida Nematzadeh , Yash Goyal , Aishwarya Agrawal

Despite Contrastive Language-Image Pretraining (CLIP)'s remarkable capability to retrieve content across modalities, a substantial modality gap persists in its feature space. Intriguingly, we discover that off-the-shelf MLLMs (Multimodal…

Computer Vision and Pattern Recognition · Computer Science 2026-01-01 Pengfei Zhao , Rongbo Luan , Wei Zhang , Peng Wu , Sifeng He

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision…

Multimodal learning seeks to combine data from multiple input sources to enhance the performance of different downstream tasks. In real-world scenarios, performance can degrade substantially if some input modalities are missing. Existing…

Machine Learning · Computer Science 2024-10-10 Niki Nezakati , Md Kaykobad Reza , Ameya Patil , Mashhour Solh , M. Salman Asif

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Md Kaykobad Reza , Ashley Prater-Bennette , M. Salman Asif

Recently, prompt learning has garnered considerable attention for its success in various Vision-Language (VL) tasks. However, existing prompt-based models are primarily focused on studying prompt generation and prompt strategies with…

Artificial Intelligence · Computer Science 2024-09-10 Ruiting Dai , Yuqiao Tan , Lisi Mo , Tao He , Ke Qin , Shuang Liang

In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context…

Artificial Intelligence · Computer Science 2025-05-27 Zihan Chen , Song Wang , Zhen Tan , Jundong Li , Cong Shen

The advent of large language models (LLMs) has sparked significant interest in using natural language for preference learning. However, existing methods often suffer from high computational burdens, taxing human supervision, and lack of…

Machine Learning · Computer Science 2024-12-23 Saaduddin Mahmud , Mason Nakamura , Shlomo Zilberstein

Multimodal learning integrates complementary information from different modalities such as image, text, and audio to improve model performance, but its success relies on large-scale labeled data, which is costly to obtain. Active learning…

Computer Vision and Pattern Recognition · Computer Science 2026-03-27 Yuqiao Zeng , Xu Wang , Tengfei Liang , Yiqing Hao , Yi Jin , Hui Yu

We present a new nonlinear dimensionality reduction method, MAPLE, that enhances UMAP by improving manifold modeling. MAPLE employs a self-supervised learning approach to more efficiently encode low-dimensional manifold geometry. Central to…

Machine Learning · Computer Science 2026-05-15 Zeyang Huang , Takanori Fujiwara , Angelos Chatzimparmpas , Wandrille Duchemin , Andreas Kerren

RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences, demonstrating exceptional and measurable efficacy in instruction following tasks; however, it exhibits insufficient compliance…

Artificial Intelligence · Computer Science 2025-05-20 Ruopei Sun , Jianfeng Cai , Jinhua Zhu , Kangwen Zhao , Dongyun Xue , Wengang Zhou , Li Li , Houqiang Li

From clinical healthcare to daily living, continuous sensor monitoring across multiple modalities has shown great promise for real-world intelligent decision-making but also faces various challenges. In this work, we introduce MAESTRO, a…

Machine Learning · Computer Science 2025-10-01 Payal Mohapatra , Yueyuan Sui , Akash Pandey , Stephen Xia , Qi Zhu

Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the…

Computer Vision and Pattern Recognition · Computer Science 2024-01-24 Guiming Cao , Kaize Shi , Hong Fu , Huaiwen Zhang , Guandong Xu

The application of visual instruction tuning and other post-training techniques has significantly enhanced the capabilities of Large Language Models (LLMs) in visual understanding, enriching Vision-Language Models (VLMs) with more…

Computer Vision and Pattern Recognition · Computer Science 2025-06-11 Mingjie Xu , Andrew Estornell , Hongzheng Yang , Yuzhi Zhao , Zhaowei Zhu , Qi Xuan , Jiaheng Wei

Building reliable speech systems often requires combining multiple modalities, like audio and visual cues. While such multimodal solutions frequently lead to improvements in performance and may even be critical in certain cases, they come…

Sound · Computer Science 2025-01-31 Joanna Hong , Sanjeel Parekh , Honglie Chen , Jacob Donley , Ke Tan , Buye Xu , Anurag Kumar

Multi-Task Learning (MTL) is designed to train multiple correlated tasks simultaneously, thereby enhancing the performance of individual tasks. Typically, a multi-task network structure consists of a shared backbone and task-specific…

Computer Vision and Pattern Recognition · Computer Science 2023-12-15 Yi Xin , Junlong Du , Qiang Wang , Ke Yan , Shouhong Ding

Multimodal foundation models have achieved impressive progress across a wide range of vision-language tasks. However, existing approaches often adopt fixed or task-specific fusion strategies, neglecting the intrinsic variability of modality…

Computer Vision and Pattern Recognition · Computer Science 2025-06-17 Liam Bennett , Mason Clark , Lucas Anderson , Hana Satou , Olivia Martinez

Multimodal fusion is susceptible to modality imbalance, where dominant modalities overshadow weak ones, easily leading to biased learning and suboptimal fusion, especially for incomplete modality conditions. To address this problem, we…

Machine Learning · Computer Science 2026-03-20 Xiang Shi , Rui Zhang , Jiawei Liu , Yinpeng Liu , Qikai Cheng , Wei Lu
‹ Prev 1 2 3 10 Next ›