Related papers: Context-Aware Multimodal Pretraining

Adaptive Cross-Modal Few-Shot Learning

Metric-based meta-learning techniques have successfully been applied to few-shot classification problems. In this paper, we propose to leverage cross-modal information to enhance metric-based few-shot learning methods. Visual and semantic…

Machine Learning · Computer Science 2020-02-19 Chen Xing , Negar Rostamzadeh , Boris N. Oreshkin , Pedro O. Pinheiro

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Recent work in vision-and-language pretraining has investigated supervised signals from object detection data to learn better, fine-grained multimodal representations. In this work, we take a step further and explore how we can tap into…

Computation and Language · Computer Science 2023-10-20 Emanuele Bugliarello , Aida Nematzadeh , Lisa Anne Hendricks

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings.…

Computer Vision and Pattern Recognition · Computer Science 2021-04-16 Po-Yao Huang , Mandela Patrick , Junjie Hu , Graham Neubig , Florian Metze , Alexander Hauptmann

Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts

Recent vision-language models are driven by large-scale pretrained models. However, adapting pretrained models on limited data presents challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and…

Computer Vision and Pattern Recognition · Computer Science 2023-09-29 Deniz Engin , Yannis Avrithis

Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models

The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples…

Computer Vision and Pattern Recognition · Computer Science 2024-08-29 Zhiqiu Lin , Samuel Yu , Zhiyi Kuang , Deepak Pathak , Deva Ramanan

Meta Learning to Bridge Vision and Language Models for Multimodal Few-Shot Learning

Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered…

Computer Vision and Pattern Recognition · Computer Science 2023-03-01 Ivona Najdenkoska , Xiantong Zhen , Marcel Worring

How to Adapt Pre-trained Vision-and-Language Models to a Text-only Input?

Current language models have been criticised for learning language from text alone without connection between words and their meaning. Consequently, multimodal training has been proposed as a way for creating models with better language…

Computation and Language · Computer Science 2022-09-20 Lovisa Hagström , Richard Johansson

Cross-Modal Adapter for Vision-Language Retrieval

Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang

Few-Shot Adversarial Prompt Learning on Vision-Language Models

The vulnerability of deep neural networks to imperceptible adversarial perturbations has attracted widespread attention. Inspired by the success of vision-language foundation models, previous efforts achieved zero-shot adversarial…

Computer Vision and Pattern Recognition · Computer Science 2024-10-24 Yiwei Zhou , Xiaobo Xia , Zhiwei Lin , Bo Han , Tongliang Liu

Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning

Adapting pre-trained models to open classes is a challenging problem in machine learning. Vision-language models fully explore the knowledge of text modality, demonstrating strong zero-shot recognition performance, which is naturally suited…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Zhengqing Gao , Xiang Ao , Xu-Yao Zhang , Cheng-Lin Liu

Multimodal Few-Shot Learning with Frozen Language Models

When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring…

Computer Vision and Pattern Recognition · Computer Science 2021-07-06 Maria Tsimpoukelli , Jacob Menick , Serkan Cabi , S. M. Ali Eslami , Oriol Vinyals , Felix Hill

Less is More: A Closer Look at Semantic-based Few-Shot Learning

Few-shot Learning aims to learn and distinguish new categories with a very limited number of available images, presenting a significant challenge in the realm of deep learning. Recent researchers have sought to leverage the additional…

Computer Vision and Pattern Recognition · Computer Science 2024-03-26 Chunpeng Zhou , Haishuai Wang , Xilu Yuan , Zhi Yu , Jiajun Bu

UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot Vision-Language Tasks

Large-scale joint training of multimodal models, e.g., CLIP, have demonstrated great performance in many vision-language tasks. However, image-text pairs for pre-training are restricted to the intersection of images and texts, limiting…

Computer Vision and Pattern Recognition · Computer Science 2023-06-09 Yanan Sun , Zihan Zhong , Qi Fan , Chi-Keung Tang , Yu-Wing Tai

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on…

Computation and Language · Computer Science 2026-02-27 Chungpa Lee , Jy-yong Sohn , Kangwook Lee

On the Limits of Multi-modal Meta-Learning with Auxiliary Task Modulation Using Conditional Batch Normalization

Few-shot learning aims to learn representations that can tackle novel tasks given a small number of examples. Recent studies show that cross-modal learning can improve representations for few-shot classification. More specifically, language…

Computer Vision and Pattern Recognition · Computer Science 2024-05-31 Jordi Armengol-Estapé , Vincent Michalski , Ramnath Kumar , Pierre-Luc St-Charles , Doina Precup , Samira Ebrahimi Kahou

Few-shot learning through contextual data augmentation

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to…

Computation and Language · Computer Science 2021-04-01 Farid Arthaud , Rachel Bawden , Alexandra Birch

Expanding Language-Image Pretrained Models for General Video Recognition

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to…

Computer Vision and Pattern Recognition · Computer Science 2022-08-05 Bolin Ni , Houwen Peng , Minghao Chen , Songyang Zhang , Gaofeng Meng , Jianlong Fu , Shiming Xiang , Haibin Ling

Efficient Transfer Learning for Video-language Foundation Models

Pre-trained vision-language models provide a robust foundation for efficient transfer learning across various downstream tasks. In the field of video action recognition, mainstream approaches often introduce additional modules to capture…

Computer Vision and Pattern Recognition · Computer Science 2025-03-19 Haoxing Chen , Zizheng Huang , Yan Hong , Yanshuo Wang , Zhongcai Lyu , Zhuoer Xu , Jun Lan , Zhangxuan Gu

A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI

Multi-modal generative AI systems, such as those combining vision and language, rely on contrastive pre-training to learn representations across different modalities. While their practical benefits are widely acknowledged, a rigorous…

Machine Learning · Computer Science 2025-10-22 Kazusato Oko , Licong Lin , Yuhang Cai , Song Mei

Meta-learning For Vision-and-language Cross-lingual Transfer

Current pre-trained vison-language models (PVLMs) achieve excellent performance on a range of multi-modal datasets. Recent work has aimed at building multilingual models, and a range of novel multilingual multi-modal datasets have been…

Computation and Language · Computer Science 2023-10-25 Hanxu Hu , Frank Keller