Related papers: Parameter Efficient Multimodal Transformers for Vi…

Cross-Modal Adapter for Vision-Language Retrieval

Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Juncheng Yang , Zuchao Li , Shuai Xie , Weiping Zhu , Wei Yu , Shijun Li

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Haoyu Lu , Yuqi Huo , Guoxing Yang , Zhiwu Lu , Wei Zhan , Masayoshi Tomizuka , Mingyu Ding

Robust Multimodal Learning with Missing Modalities via Parameter-Efficient Adaptation

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Md Kaykobad Reza , Ashley Prater-Bennette , M. Salman Asif

Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach

Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Taolin Zhang , Jiawang Bai , Zhihe Lu , Dongze Lian , Genping Wang , Xinchao Wang , Shu-Tao Xia

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Junting Pan , Ziyi Lin , Xiatian Zhu , Jing Shao , Hongsheng Li

Parameter-Efficient Multi-Task Learning via Progressive Task-Specific Adaptation

Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Neeraj Gangwar , Anshuka Rangi , Rishabh Deshmukh , Holakou Rahmanian , Yesh Dattatreya , Nickvash Kani

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Shoufa Chen , Chongjian Ge , Zhan Tong , Jiangliu Wang , Yibing Song , Jue Wang , Ping Luo

Time-, Memory- and Parameter-Efficient Visual Adaptation

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Otniel-Bogdan Mercea , Alexey Gritsenko , Cordelia Schmid , Anurag Arnab

Parameter-efficient Model Adaptation for Vision Transformers

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Xuehai He , Chunyuan Li , Pengchuan Zhang , Jianwei Yang , Xin Eric Wang

Parameter-efficient transfer learning of pre-trained Transformer models for speaker verification using adapters

Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-31 Junyi Peng , Themos Stafylakis , Rongzhi Gu , Oldřich Plchot , Ladislav Mošner , Lukáš Burget , Jan Černocký

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Dongshuo Yin , Xueting Han , Bin Li , Hao Feng , Jing Bai

Video Transformers: A Survey

Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further…

Computer Vision and Pattern Recognition · Computer Science 2023-02-14 Javier Selva , Anders S. Johansen , Sergio Escalera , Kamal Nasrollahi , Thomas B. Moeslund , Albert Clapés

Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2022-08-19 Nina Shvetsova , Brian Chen , Andrew Rouditchenko , Samuel Thomas , Brian Kingsbury , Rogerio Feris , David Harwath , James Glass , Hilde Kuehne

Context-Aware Multimodal Pretraining

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 Karsten Roth , Zeynep Akata , Dima Damen , Ivana Balažević , Olivier J. Hénaff

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational…

Computer Vision and Pattern Recognition · Computer Science 2021-05-13 Rameswar Panda , Chun-Fu Chen , Quanfu Fan , Ximeng Sun , Kate Saenko , Aude Oliva , Rogerio Feris

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing…

Computation and Language · Computer Science 2021-06-09 Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani , James Henderson

Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain

Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn…

Computer Vision and Pattern Recognition · Computer Science 2023-11-15 Dota Tianai Dong , Mariya Toneva

Memory Efficient Continual Learning with Transformers

In many real-world scenarios, data to train machine learning models becomes available over time. Unfortunately, these models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is…

Computation and Language · Computer Science 2023-01-16 Beyza Ermis , Giovanni Zappella , Martin Wistuba , Aditya Rawal , Cedric Archambeau

Parameter Efficient Transfer Learning for Various Speech Processing Tasks

Fine-tuning of self-supervised models is a powerful transfer learning method in a variety of fields, including speech processing, since it can utilize generic feature representations obtained from large amounts of unlabeled data.…

Multimedia · Computer Science 2022-12-07 Shinta Otake , Rei Kawakami , Nakamasa Inoue