Related papers: Multi-Modulation Network for Audio-Visual Event Lo…

MPN: Multimodal Parallel Network for Audio-Visual Event Localization

Audio-visual event localization aims to localize an event that is both audible and visible in the wild, which is a widespread audio-visual scene analysis task for unconstrained videos. To address this task, we propose a Multimodal Parallel…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 Jiashuo Yu , Ying Cheng , Rui Feng

Audio-Visual Event Localization in Unconstrained Videos

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE)…

Computer Vision and Pattern Recognition · Computer Science 2018-03-26 Yapeng Tian , Jing Shi , Bochen Li , Zhiyao Duan , Chenliang Xu

Dual-modality seq2seq network for audio-visual event localization

Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level). To address this task, we pro-pose a deep neural network named Audio-Visual…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Yan-Bo Lin , Yu-Jhe Li , Yu-Chiang Frank Wang

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion…

Computer Vision and Pattern Recognition · Computer Science 2021-06-15 Mathilde Brousmiche , Jean Rouat , Stéphane Dupont

Cross-Modality and Within-Modality Regularization for Audio-Visual DeepFake Detection

Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and…

Multimedia · Computer Science 2024-01-12 Heqing Zou , Meng Shen , Yuchen Hu , Chen Chen , Eng Siong Chng , Deepu Rajan

Multi-scale Multi-instance Visual Sound Localization and Segmentation

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Shentong Mo , Haofan Wang

CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization

The audio-visual event localization task requires identifying concurrent visual and auditory events from unconstrained videos within a network model, locating them, and classifying their category. The efficient extraction and integration of…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Xiang He , Xiangxi Liu , Yang Li , Dongcheng Zhao , Guobin Shen , Qingqun Kong , Xin Yang , Yi Zeng

MFF-EINV2: Multi-scale Feature Fusion across Spectral-Spatial-Temporal Domains for Sound Event Localization and Detection

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD.…

Sound · Computer Science 2024-06-18 Da Mu , Zhicheng Zhang , Haobo Yue

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2020-08-18 Bin Duan , Hao Tang , Wei Wang , Ziliang Zong , Guowei Yang , Yan Yan

STNet: Deep Audio-Visual Fusion Network for Robust Speaker Tracking

Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Yidi Li , Hong Liu , Bing Yang

Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing

Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the…

Computer Vision and Pattern Recognition · Computer Science 2023-11-15 Yating Xu , Conghui Hu , Gim Hee Lee

Pilot-guided Multimodal Semantic Communication for Audio-Visual Event Localization

Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial…

Sound · Computer Science 2024-12-10 Fei Yu , Zhe Xiang , Nan Che , Zhuoran Zhang , Yuandi Li , Junxiao Xue , Zhiguo Wan

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Jiashuo Yu , Ying Cheng , Rui-Wei Zhao , Rui Feng , Yuejie Zhang

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Haiyang Xu , Qinghao Ye , Ming Yan , Yaya Shi , Jiabo Ye , Yuanhong Xu , Chenliang Li , Bin Bi , Qi Qian , Wei Wang , Guohai Xu , Ji Zhang , Songfang Huang , Fei Huang , Jingren Zhou

Language Guided Networks for Cross-modal Moment Retrieval

We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Kun Liu , Huadong Ma , Chuang Gan

On Attention Modules for Audio-Visual Synchronization

With the development of media and networking technologies, multimedia applications ranging from feature presentation in a cinema setting to video on demand to interactive video conferencing are in great demand. Good synchronization between…

Computer Vision and Pattern Recognition · Computer Science 2018-12-17 Naji Khosravan , Shervin Ardeshir , Rohit Puri

Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Ziheng Zhou , Jinxing Zhou , Wei Qian , Shengeng Tang , Xiaojun Chang , Dan Guo

Fusion of Audio and Visual Embeddings for Sound Event Localization and Detection

Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-12-15 Davide Berghi , Peipei Wu , Jinzheng Zhao , Wenwu Wang , Philip J. B. Jackson

Investigating Modality Bias in Audio Visual Video Parsing

We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a…

Computer Vision and Pattern Recognition · Computer Science 2022-11-14 Piyush Singh Pasi , Shubham Nemani , Preethi Jyothi , Ganesh Ramakrishnan

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by…

Computer Vision and Pattern Recognition · Computer Science 2021-11-05 Abhinav Valada , Rohit Mohan , Wolfram Burgard