Related papers: Exploring Efficient-Tuned Learning Audio Represent…

UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models

Multimodal large models have been recognized for their advantages in various performance and downstream tasks. The development of these models is crucial towards achieving general artificial intelligence in the future. In this paper, we…

Sound · Computer Science 2023-09-12 Sen Fang , Bowen Gao , Yangjian Wu , Teik Toe Teoh

Wav2CLIP: Learning Robust Audio Representations From CLIP

We propose Wav2CLIP, a robust audio representation learning method by distilling from Contrastive Language-Image Pre-training (CLIP). We systematically evaluate Wav2CLIP on a variety of audio tasks including classification, retrieval, and…

Sound · Computer Science 2022-02-16 Ho-Hsiang Wu , Prem Seetharaman , Kundan Kumar , Juan Pablo Bello

Veagle: Advancements in Multimodal Representation Learning

Lately, researchers in artificial intelligence have been really interested in how language and vision come together, giving rise to the development of multimodal models that aim to seamlessly integrate textual and visual information.…

Computer Vision and Pattern Recognition · Computer Science 2024-10-29 Rajat Chawla , Arkajit Datta , Tushar Verma , Adarsh Jha , Anmol Gautam , Ayush Vatsal , Sukrit Chaterjee , Mukunda NS , Ishaan Bhola

Decoding Visual Neural Representations by Multimodal Learning of Brain-Visual-Linguistic Features

Decoding human visual neural representations is a challenging task with great scientific significance in revealing vision-processing mechanisms and developing brain-like intelligent machines. Most existing methods are difficult to…

Computer Vision and Pattern Recognition · Computer Science 2023-03-31 Changde Du , Kaicheng Fu , Jinpeng Li , Huiguang He

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR)…

Image and Video Processing · Electrical Eng. & Systems 2022-07-12 Zi-Qiang Zhang , Jie Zhang , Jian-Shu Zhang , Ming-Hui Wu , Xin Fang , Li-Rong Dai

Audio-Visual Model Distillation Using Acoustic Images

In this paper, we investigate how to learn rich and robust feature representations for audio classification from visual data and acoustic images, a novel audio data modality. Former models learn audio representations from raw signals or…

Computer Vision and Pattern Recognition · Computer Science 2020-02-12 Andrés F. Pérez , Valentina Sanguineti , Pietro Morerio , Vittorio Murino

Versatile audio-visual learning for emotion recognition

Most current audio-visual emotion recognition models lack the flexibility needed for deployment in practical applications. We envision a multimodal system that works even when only one modality is available and can be implemented…

Machine Learning · Computer Science 2026-01-13 Lucas Goncalves , Seong-Gyun Leem , Wei-Cheng Lin , Berrak Sisman , Carlos Busso

Multichannel AV-wav2vec2: A Framework for Learning Multichannel Multi-Modal Speech Representation

Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Qiushi Zhu , Jie Zhang , Yu Gu , Yuchen Hu , Lirong Dai

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \&…

Computer Vision and Pattern Recognition · Computer Science 2026-02-24 Changli Tang , Qinfan Xiao , Ke Mei , Tianyi Wang , Fengyun Rao , Chao Zhang

MAViL: Masked Audio-Video Learners

We present Masked Audio-Video Learners (MAViL) to train audio-visual representations. Our approach learns with three complementary forms of self-supervision: (1) reconstruction of masked audio and video input data, (2) intra- and…

Computer Vision and Pattern Recognition · Computer Science 2023-07-18 Po-Yao Huang , Vasu Sharma , Hu Xu , Chaitanya Ryali , Haoqi Fan , Yanghao Li , Shang-Wen Li , Gargi Ghosh , Jitendra Malik , Christoph Feichtenhofer

A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports

Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval,…

Computer Vision and Pattern Recognition · Computer Science 2020-09-04 Yikuan Li , Hanyin Wang , Yuan Luo

Multimodal Input Aids a Bayesian Model of Phonetic Learning

One of the many tasks facing the typically-developing child language learner is learning to discriminate between the distinctive sounds that make up words in their native language. Here we investigate whether multimodal…

Computation and Language · Computer Science 2024-07-24 Sophia Zhi , Roger P. Levy , Stephan C. Meylan

Learning Cross-lingual Visual Speech Representations

Cross-lingual self-supervised learning has been a growing research topic in the last few years. However, current works only explored the use of audio signals to create representations. In this work, we study cross-lingual self-supervised…

Computation and Language · Computer Science 2023-03-17 Andreas Zinonos , Alexandros Haliassos , Pingchuan Ma , Stavros Petridis , Maja Pantic

Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training

Recently a number of studies demonstrated impressive performance on diverse vision-language multi-modal tasks such as image captioning and visual question answering by extending the BERT architecture with multi-modal pre-training…

Computer Vision and Pattern Recognition · Computer Science 2022-09-22 Jong Hak Moon , Hyungyung Lee , Woncheol Shin , Young-Hak Kim , Edward Choi

Do Audio-Visual Large Language Models Really See and Hear?

Audio-Visual Large Language Models (AVLLMs) are emerging as unified interfaces to multimodal perception. We present the first mechanistic interpretability study of AVLLMs, analyzing how audio and visual features evolve and fuse through…

Artificial Intelligence · Computer Science 2026-04-06 Ramaneswaran Selvakumar , Kaousheik Jayakumar , S Sakshi , Sreyan Ghosh , Ruohan Gao , Dinesh Manocha

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap…

Computer Vision and Pattern Recognition · Computer Science 2024-12-10 Kim Sung-Bin , Arda Senocak , Hyunwoo Ha , Tae-Hyun Oh

Seeing is Believing: Emotion-Aware Audio-Visual Language Modeling for Expressive Speech Generation

We present an Audio-Visual Language Model (AVLM) for expressive speech generation by integrating full-face visual cues into a pre-trained expressive speech model. We explore multiple visual encoders and multimodal fusion strategies during…

Computation and Language · Computer Science 2025-08-29 Weiting Tan , Jiachen Lian , Hirofumi Inaguma , Paden Tomasello , Philipp Koehn , Xutai Ma

Multimodal Self-Supervised Learning of General Audio Representations

We present a multimodal framework to learn general audio representations from videos. Existing contrastive audio representation learning methods mainly focus on using the audio modality alone during training. In this work, we show that…

Sound · Computer Science 2021-04-29 Luyu Wang , Pauline Luc , Adria Recasens , Jean-Baptiste Alayrac , Aaron van den Oord

Leveraging Pre-Trained Models for Multimodal Class-Incremental Learning under Adaptive Fusion

Unlike traditional Multimodal Class-Incremental Learning (MCIL) methods that focus only on vision and text, this paper explores MCIL across vision, audio and text modalities, addressing challenges in integrating complementary information…

Machine Learning · Computer Science 2025-06-13 Yukun Chen , Zihuan Qiu , Fanman Meng , Hongliang Li , Linfeng Xu , Qingbo Wu

Leveraging Unimodal Self-Supervised Learning for Multimodal Audio-Visual Speech Recognition

Training Transformer-based models demands a large amount of data, while obtaining aligned and labelled data in multimodality is rather cost-demanding, especially for audio-visual speech recognition (AVSR). Thus it makes a lot of sense to…

Sound · Computer Science 2022-03-29 Xichen Pan , Peiyu Chen , Yichen Gong , Helong Zhou , Xinbing Wang , Zhouhan Lin