English
Related papers

Related papers: Multi-Modulation Network for Audio-Visual Event Lo…

200 papers

Audio-visual event localization aims to localize an event that is both audible and visible in the wild, which is a widespread audio-visual scene analysis task for unconstrained videos. To address this task, we propose a Multimodal Parallel…

Computer Vision and Pattern Recognition · Computer Science 2021-04-08 Jiashuo Yu , Ying Cheng , Rui Feng

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event(AVE)…

Computer Vision and Pattern Recognition · Computer Science 2018-03-26 Yapeng Tian , Jing Shi , Bochen Li , Zhiyao Duan , Chenliang Xu

Audio-visual event localization requires one to identify theevent which is both visible and audible in a video (eitherat a frame or video level). To address this task, we pro-pose a deep neural network named Audio-Visual…

Computer Vision and Pattern Recognition · Computer Science 2020-08-07 Yan-Bo Lin , Yu-Jhe Li , Yu-Chiang Frank Wang

Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion…

Computer Vision and Pattern Recognition · Computer Science 2021-06-15 Mathilde Brousmiche , Jean Rouat , Stéphane Dupont

Audio-visual deepfake detection scrutinizes manipulations in public video using complementary multimodal cues. Current methods, which train on fused multimodal data for multimodal targets face challenges due to uncertainties and…

Multimedia · Computer Science 2024-01-12 Heqing Zou , Meng Shen , Yuchen Hu , Chen Chen , Eng Siong Chng , Deepu Rajan

Visual sound localization is a typical and challenging problem that predicts the location of objects corresponding to the sound source in a video. Previous methods mainly used the audio-visual association between global audio and one-scale…

Computer Vision and Pattern Recognition · Computer Science 2024-09-04 Shentong Mo , Haofan Wang

The audio-visual event localization task requires identifying concurrent visual and auditory events from unconstrained videos within a network model, locating them, and classifying their category. The efficient extraction and integration of…

Computer Vision and Pattern Recognition · Computer Science 2024-08-06 Xiang He , Xiangxi Liu , Yang Li , Dongcheng Zhao , Guobin Shen , Qingqun Kong , Xin Yang , Yi Zeng

Sound Event Localization and Detection (SELD) involves detecting and localizing sound events using multichannel sound recordings. Previously proposed Event-Independent Network V2 (EINV2) has achieved outstanding performance on SELD.…

Sound · Computer Science 2024-06-18 Da Mu , Zhicheng Zhang , Haobo Yue

The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we…

Computer Vision and Pattern Recognition · Computer Science 2020-08-18 Bin Duan , Hao Tang , Wei Wang , Ziliang Zong , Guowei Yang , Yan Yan

Audio-visual speaker tracking aims to determine the location of human targets in a scene using signals captured by a multi-sensor platform, whose accuracy and robustness can be improved by multi-modal fusion methods. Recently, several…

Computer Vision and Pattern Recognition · Computer Science 2024-10-10 Yidi Li , Hong Liu , Bing Yang

Existing works on weakly-supervised audio-visual video parsing adopt hybrid attention network (HAN) as the multi-modal embedding to capture the cross-modal context. It embeds the audio and visual modalities with a shared network, where the…

Computer Vision and Pattern Recognition · Computer Science 2023-11-15 Yating Xu , Conghui Hu , Gim Hee Lee

Multimodal semantic communication, which integrates various data modalities such as text, images, and audio, significantly enhances communication efficiency and reliability. It has broad application prospects in fields such as artificial…

Sound · Computer Science 2024-12-10 Fei Yu , Zhe Xiang , Nan Che , Zhuoran Zhang , Yuandi Li , Junxiao Xue , Zhiguo Wan

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous…

Computer Vision and Pattern Recognition · Computer Science 2022-07-13 Jiashuo Yu , Ying Cheng , Rui-Wei Zhao , Rui Feng , Yuejie Zhang

Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality…

Computer Vision and Pattern Recognition · Computer Science 2023-05-12 Haiyang Xu , Qinghao Ye , Ming Yan , Yaya Shi , Jiabo Ye , Yuanhong Xu , Chenliang Li , Bin Bi , Qi Qian , Wei Wang , Guohai Xu , Ji Zhang , Songfang Huang , Fei Huang , Jingren Zhou

We address the challenging task of cross-modal moment retrieval, which aims to localize a temporal segment from an untrimmed video described by a natural language query. It poses great challenges over the proper semantic alignment between…

Computer Vision and Pattern Recognition · Computer Science 2022-08-22 Kun Liu , Huadong Ma , Chuang Gan

With the development of media and networking technologies, multimedia applications ranging from feature presentation in a cinema setting to video on demand to interactive video conferencing are in great demand. Good synchronization between…

Computer Vision and Pattern Recognition · Computer Science 2018-12-17 Naji Khosravan , Shervin Ardeshir , Rohit Puri

In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for…

Computer Vision and Pattern Recognition · Computer Science 2024-12-19 Ziheng Zhou , Jinxing Zhou , Wei Qian , Shengeng Tang , Xiaojun Chang , Dan Guo

Sound event localization and detection (SELD) combines two subtasks: sound event detection (SED) and direction of arrival (DOA) estimation. SELD is usually tackled as an audio-only problem, but visual information has been recently included.…

Audio and Speech Processing · Electrical Eng. & Systems 2023-12-15 Davide Berghi , Peipei Wu , Jinzheng Zhao , Wenwu Wang , Philip J. B. Jackson

We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries. The task is especially challenging since it is weakly supervised with only event labels available as a…

Computer Vision and Pattern Recognition · Computer Science 2022-11-14 Piyush Singh Pasi , Shubham Nemani , Preethi Jyothi , Ganesh Ramakrishnan

Learning to reliably perceive and understand the scene is an integral enabler for robots to operate in the real-world. This problem is inherently challenging due to the multitude of object types as well as appearance changes caused by…

Computer Vision and Pattern Recognition · Computer Science 2021-11-05 Abhinav Valada , Rohit Mohan , Wolfram Burgard
‹ Prev 1 2 3 10 Next ›