Related papers: Self-supervised learning for audio-visual speaker …

A Review of Speaker Diarization: Recent Advances with Deep Learning

Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for…

Audio and Speech Processing · Electrical Eng. & Systems 2021-11-29 Tae Jin Park , Naoyuki Kanda , Dimitrios Dimitriadis , Kyu J. Han , Shinji Watanabe , Shrikanth Narayanan

Audio-Visual Speaker Diarization Based on Spatiotemporal Bayesian Fusion

Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants…

Computer Vision and Pattern Recognition · Computer Science 2018-10-15 Israel D. Gebru , Silèye Ba , Xiaofei Li , Radu Horaud

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization

Speaker diarization, the process of segmenting an audio stream or transcribed speech content into homogenous partitions based on speaker identity, plays a crucial role in the interpretation and analysis of human speech. Most existing…

Machine Learning · Computer Science 2024-08-23 Luyao Cheng , Hui Wang , Siqi Zheng , Yafeng Chen , Rongjie Huang , Qinglin Zhang , Qian Chen , Xihao Li

Joint Speech Recognition and Speaker Diarization via Sequence Transduction

Speech applications dealing with conversations require not only recognizing the spoken words, but also determining who spoke when. The task of assigning words to speakers is typically addressed by merging the outputs of two separate…

Computation and Language · Computer Science 2019-07-12 Laurent El Shafey , Hagen Soltau , Izhak Shafran

Once more Diarization: Improving meeting transcription systems through segment-level speaker reassignment

Diarization is a crucial component in meeting transcription systems to ease the challenges of speech enhancement and attribute the transcriptions to the correct speaker. Particularly in the presence of overlapping or noisy speech, these…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-06 Christoph Boeddeker , Tobias Cord-Landwehr , Reinhold Haeb-Umbach

Unsupervised Speaker Diarization in Distributed IoT Networks Using Federated Learning

This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without…

Sound · Computer Science 2024-12-02 Amit Kumar Bhuyan , Hrishikesh Dutta , Subir Biswas

Chronological Self-Training for Real-Time Speaker Diarization

Diarization partitions an audio stream into segments based on the voices of the speakers. Real-time diarization systems that include an enrollment step should limit enrollment training samples to reduce user interaction time. Although…

Sound · Computer Science 2022-08-09 Dirk Padfield , Daniel J. Liebling

Robust Target Speaker Diarization and Separation via Augmented Speaker Embedding Sampling

Traditional speech separation and speaker diarization approaches rely on prior knowledge of target speakers or a predetermined number of participants in audio signals. To address these limitations, recent advances focus on developing…

Sound · Computer Science 2025-08-11 Md Asif Jalal , Luca Remaggi , Vasileios Moschopoulos , Thanasis Kotsiopoulos , Vandana Rajan , Karthikeyan Saravanan , Anastasis Drosou , Junho Heo , Hyuk Oh , Seokyeong Jeong

Pretraining Multi-Speaker Identification for Neural Speaker Diarization

End-to-end speaker diarization enables accurate overlap-aware diarization by jointly estimating multiple speakers' speech activities in parallel. This approach is data-hungry, requiring a large amount of labeled conversational data, which…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-02 Shota Horiguchi , Atsushi Ando , Marc Delcroix , Naohiro Tawara

On the Use of Self-Supervised Representation Learning for Speaker Diarization and Separation

Self-supervised speech models such as wav2vec2.0 and WavLM have been shown to significantly improve the performance of many downstream speech tasks, especially in low-resource settings, over the past few years. Despite this, evaluations on…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-18 Séverin Baroudi , Hervé Bredin , Joseph Razik , Ricard Marxer

Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version)

This paper presents a novel evaluation approach to text-based speaker diarization (SD), tackling the limitations of traditional metrics that do not account for any contextual information in text. Two new metrics are proposed, Text-based…

Computation and Language · Computer Science 2023-09-15 Chen Gong , Peilin Wu , Jinho D. Choi

Speaker Diarization: Using Recurrent Neural Networks

Speaker Diarization is the problem of separating speakers in an audio. There could be any number of speakers and final result should state when speaker starts and ends. In this project, we analyze given audio file with 2 channels and 2…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-11 Vishal Sharma , Zekun Zhang , Zachary Neubert , Curtis Dyreson

Self-supervised Speaker Diarization

Over the last few years, deep learning has grown in popularity for speaker verification, identification, and diarization. Inarguably, a significant part of this success is due to the demonstrated effectiveness of their speaker…

Sound · Computer Science 2022-10-07 Yehoshua Dissen , Felix Kreuk , Joseph Keshet

Triplet Network with Attention for Speaker Diarization

In automatic speech processing systems, speaker diarization is a crucial front-end component to separate segments from different speakers. Inspired by the recent success of deep neural networks (DNNs) in semantic inferencing, triplet…

Audio and Speech Processing · Electrical Eng. & Systems 2018-08-07 Huan Song , Megan Willi , Jayaraman J. Thiagarajan , Visar Berisha , Andreas Spanias

Speaker Diarization of Scripted Audiovisual Content

The media localization industry usually requires a verbatim script of the final film or TV production in order to create subtitles or dubbing scripts in a foreign language. In particular, the verbatim script (i.e. as-broadcast script) must…

Computation and Language · Computer Science 2023-08-07 Yogesh Virkar , Brian Thompson , Rohit Paturi , Sundararajan Srinivasan , Marcello Federico

A Real-time Speaker Diarization System Based on Spatial Spectrum

In this paper we describe a speaker diarization system that enables localization and identification of all speakers present in a conversation or meeting. We propose a novel systematic approach to tackle several long-standing challenges in…

Sound · Computer Science 2021-07-21 Siqi Zheng , Weilong Huang , Xianliang Wang , Hongbin Suo , Jinwei Feng , Zhijie Yan

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

Multi-speaker automatic speech recognition (ASR) aims to transcribe conversational speech involving multiple speakers, requiring the model to capture not only what was said, but also who said it and sometimes when it was spoken. Recent…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-27 Li Li , Ming Cheng , Weixin Zhu , Yannan Wang , Juan Liu , Ming Li

A Reinforcement Learning Framework for Online Speaker Diarization

Speaker diarization is a task to label an audio or video recording with the identity of the speaker at each given time stamp. In this work, we propose a novel machine learning framework to conduct real-time multi-speaker diarization and…

Sound · Computer Science 2023-02-23 Baihan Lin , Xinxin Zhang

Efficient Personalized Speech Enhancement through Self-Supervised Learning

This work presents self-supervised learning methods for developing monaural speaker-specific (i.e., personalized) speech enhancement models. While generalist models must broadly address many speakers, specialist models can adapt their…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-28 Aswin Sivaraman , Minje Kim

Self-Supervised Learning of Audio-Visual Objects from Video

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate…

Computer Vision and Pattern Recognition · Computer Science 2020-08-11 Triantafyllos Afouras , Andrew Owens , Joon Son Chung , Andrew Zisserman