English
Related papers

Related papers: Exploring Efficient-Tuned Learning Audio Represent…

200 papers

Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we…

Sound · Computer Science 2023-09-12 Sen Fang , Bowen Gao , Yangjian Wu , Teik Toe Teoh

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and…

Sound · Computer Science 2022-02-16 Ho-Hsiang Wu , Prem Seetharaman , Kundan Kumar , Juan Pablo Bello

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Rajat Chawla , Arkajit Datta , Tushar Verma , Adarsh Jha , Anmol Gautam , Ayush Vatsal , Sukrit Chaterjee , Mukunda NS , Ishaan Bhola

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Changde Du , Kaicheng Fu , Jinpeng Li , Huiguang He

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR)…

Image and Video Processing · Electrical Eng. & Systems 2022-07-12 Zi-Qiang Zhang , Jie Zhang , Jian-Shu Zhang , Ming-Hui Wu , Xin Fang , Li-Rong Dai

In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or…

Computer Vision and Pattern Recognition · Computer Science 2020-02-12 Andrés F. Pérez , Valentina Sanguineti , Pietro Morerio , Vittorio Murino

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented…

Machine Learning · Computer Science 2026-01-13 Lucas Goncalves , Seong-Gyun Leem , Wei-Cheng Lin , Berrak Sisman , Carlos Busso

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Qiushi Zhu , Jie Zhang , Yu Gu , Yuchen Hu , Lirong Dai

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \&…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Changli Tang , Qinfan Xiao , Ke Mei , Tianyi Wang , Fengyun Rao , Chao Zhang

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Po-Yao Huang , Vasu Sharma , Hu Xu , Chaitanya Ryali , Haoqi Fan , Yanghao Li , Shang-Wen Li , Gargi Ghosh , Jitendra Malik , Christoph Feichtenhofer

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2020-09-04 Yikuan Li , Hanyin Wang , Yuan Luo

One of the many tasks facing the typically-developing child language learner is learning to discriminate between the distinctive sounds that make up words in their native language. Here we investigate whether multimodal…

Computation and Language · Computer Science 2024-07-24 Sophia Zhi , Roger P. Levy , Stephan C. Meylan

Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised…

Computation and Language · Computer Science 2023-03-17 Andreas Zinonos , Alexandros Haliassos , Pingchuan Ma , Stavros Petridis , Maja Pantic

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Jong Hak Moon , Hyungyung Lee , Woncheol Shin , Young-Hak Kim , Edward Choi

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through…

Artificial Intelligence · Computer Science 2026-04-06 Ramaneswaran Selvakumar , Kaousheik Jayakumar , S Sakshi , Sreyan Ghosh , Ruohan Gao , Dinesh Manocha

How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Kim Sung-Bin , Arda Senocak , Hyunwoo Ha , Tae-Hyun Oh

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during…

Computation and Language · Computer Science 2025-08-29 Weiting Tan , Jiachen Lian , Hirofumi Inaguma , Paden Tomasello , Philipp Koehn , Xutai Ma

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that…

Sound · Computer Science 2021-04-29 Luyu Wang , Pauline Luc , Adria Recasens , Jean-Baptiste Alayrac , Aaron van den Oord

Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information…

Machine Learning · Computer Science 2025-06-13 Yukun Chen , Zihuan Qiu , Fanman Meng , Hongliang Li , Linfeng Xu , Qingbo Wu

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to…

Sound · Computer Science 2022-03-29 Xichen Pan , Peiyu Chen , Yichen Gong , Helong Zhou , Xinbing Wang , Zhouhan Lin
‹ Prev 1 2 3 10 Next ›