English
Related papers

Related papers: Parameter Efficient Multimodal Transformers for Vi…

200 papers

Vision-language retrieval is an important multi-modal learning topic, where the goal is to retrieve the most relevant visual candidate for a given text query. Recently, pre-trained models, e.g., CLIP, show great potential on retrieval…

Computer Vision and Pattern Recognition · Computer Science 2025-09-03 Haojun Jiang , Jianke Zhang , Rui Huang , Chunjiang Ge , Zanlin Ni , Shiji Song , Gao Huang

Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource…

Computer Vision and Pattern Recognition · Computer Science 2024-04-22 Juncheng Yang , Zuchao Li , Shuai Xie , Weiping Zhu , Wei Yu , Shijun Li

Large-scale vision-language pre-trained models have shown promising transferability to various downstream tasks. As the size of these foundation models and the number of downstream tasks grow, the standard full fine-tuning paradigm becomes…

Computer Vision and Pattern Recognition · Computer Science 2023-05-23 Haoyu Lu , Yuqi Huo , Guoxing Yang , Zhiwu Lu , Wei Zhan , Masayoshi Tomizuka , Mingyu Ding

Multimodal learning seeks to utilize data from multiple sources to improve the overall performance of downstream tasks. It is desirable for redundancies in the data to make multimodal systems robust to missing or corrupted observations in…

Computer Vision and Pattern Recognition · Computer Science 2024-10-14 Md Kaykobad Reza , Ashley Prater-Bennette , M. Salman Asif

Recent works on parameter-efficient transfer learning (PETL) show the potential to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters. However, since they usually insert new…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Taolin Zhang , Jiawang Bai , Zhihe Lu , Dongze Lian , Genping Wang , Xinchao Wang , Shu-Tao Xia

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes…

Computer Vision and Pattern Recognition · Computer Science 2022-10-14 Junting Pan , Ziyi Lin , Xiatian Zhu , Jing Shao , Hongsheng Li

Parameter-efficient fine-tuning methods have emerged as a promising solution for adapting pre-trained models to various downstream tasks. While these methods perform well in single-task learning, extending them to multi-task learning…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Neeraj Gangwar , Anshuka Rangi , Rishabh Deshmukh , Holakou Rahmanian , Yesh Dattatreya , Nickvash Kani

Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and…

Computer Vision and Pattern Recognition · Computer Science 2022-10-18 Shoufa Chen , Chongjian Ge , Zhan Tong , Jiangliu Wang , Yibing Song , Jue Wang , Ping Luo

As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many…

Computer Vision and Pattern Recognition · Computer Science 2024-02-06 Otniel-Bogdan Mercea , Alexey Gritsenko , Cordelia Schmid , Anurag Arnab

In computer vision, it has achieved great transfer learning performance via adapting large-scale pretrained vision models (e.g., vision transformers) to downstream tasks. Common approaches for model adaptation either update all model…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Xuehai He , Chunyuan Li , Pengchuan Zhang , Jianwei Yang , Xin Eric Wang

Recently, the pre-trained Transformer models have received a rising interest in the field of speech processing thanks to their great success in various downstream tasks. However, most fine-tuning approaches update all the parameters of the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-31 Junyi Peng , Themos Stafylakis , Rongzhi Gu , Oldřich Plchot , Ladislav Mošner , Lukáš Burget , Jan Černocký

Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in adapting to downstream tasks with only a few trainable…

Computer Vision and Pattern Recognition · Computer Science 2023-11-29 Dongshuo Yin , Xueting Han , Bin Li , Hao Feng , Jing Bai

Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further…

Computer Vision and Pattern Recognition · Computer Science 2023-02-14 Javier Selva , Anders S. Johansen , Sergio Escalera , Kamal Nasrollahi , Thomas B. Moeslund , Albert Clapés

Multi-modal learning from video data has seen increased attention recently as it allows to train semantically meaningful embeddings without human annotation enabling tasks like zero-shot retrieval and classification. In this work, we…

Computer Vision and Pattern Recognition · Computer Science 2022-08-19 Nina Shvetsova , Brian Chen , Andrew Rouditchenko , Samuel Thomas , Brian Kingsbury , Rogerio Feris , David Harwath , James Glass , Hilde Kuehne

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage…

Computer Vision and Pattern Recognition · Computer Science 2024-11-25 Karsten Roth , Zeynep Akata , Dima Damen , Ivana Balažević , Olivier J. Hénaff

Multi-modal learning, which focuses on utilizing various modalities to improve the performance of a model, is widely used in video recognition. While traditional multi-modal learning offers excellent recognition results, its computational…

Computer Vision and Pattern Recognition · Computer Science 2021-05-13 Rameswar Panda , Chun-Fu Chen , Quanfu Fan , Ximeng Sun , Kate Saenko , Aude Oliva , Rogerio Feris

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing…

Computation and Language · Computer Science 2021-06-09 Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani , James Henderson

Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn…

Computer Vision and Pattern Recognition · Computer Science 2023-11-15 Dota Tianai Dong , Mariya Toneva

In many real-world scenarios, data to train machine learning models becomes available over time. Unfortunately, these models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is…

Computation and Language · Computer Science 2023-01-16 Beyza Ermis , Giovanni Zappella , Martin Wistuba , Aditya Rawal , Cedric Archambeau

Fine-tuning of self-supervised models is a powerful transfer learning method in a variety of fields, including speech processing, since it can utilize generic feature representations obtained from large amounts of unlabeled data.…

Multimedia · Computer Science 2022-12-07 Shinta Otake , Rei Kawakami , Nakamasa Inoue
‹ Prev 1 2 3 10 Next ›