Related papers: Encoder-decoder multimodal speaker change detectio…

Speaker Change Detection for Transformer Transducer ASR

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points. Existing…

Audio and Speech Processing · Electrical Eng. & Systems 2023-02-20 Jian Wu , Zhuo Chen , Min Hu , Xiong Xiao , Jinyu Li

USM-SCD: Multilingual Speaker Change Detection Based on Large Pretrained Foundation Models

We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Guanlong Zhao , Yongqiang Wang , Jason Pelecanos , Yu Zhang , Hank Liao , Yiling Huang , Han Lu , Quan Wang

Spoken language change detection inspired by speaker change detection

Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-19 Jagabandhu Mishra , S. R. Mahadeva Prasanna

Augmenting Transformer-Transducer Based Speaker Change Detection With Token-Level Training Loss

In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Guanlong Zhao , Quan Wang , Han Lu , Yiling Huang , Ignacio Lopez Moreno

Sequence-level Speaker Change Detection with Difference-based Continuous Integrate-and-fire

Speaker change detection is an important task in multi-party interactions such as meetings and conversations. In this paper, we address the speaker change detection task from the perspective of sequence transduction. Specifically, we…

Sound · Computer Science 2022-06-28 Zhiyun Fan , Linhao Dong , Meng Cai , Zejun Ma , Bo Xu

Multi-Encoder-Decoder Transformer for Code-Switching Speech Recognition

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-19 Xinyuan Zhou , Emre Yılmaz , Yanhua Long , Yijie Li , Haizhou Li

Language vs Speaker Change: A Comparative Study

Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Jagabandhu Mishra , S. R. Mahadeva Prasanna

Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach

We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-12 Amit Eliav , Sharon Gannot

Audio-Visual Approach For Multimodal Concurrent Speaker Detection

Concurrent Speaker Detection (CSD), the task of identifying active speakers and their overlaps in an audio signal, is essential for various audio applications, including meeting transcription, speaker diarization, and speech separation.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-16 Amit Eliav , Sharon Gannot

Improving robustness of one-shot voice conversion with deep discriminative speaker encoder

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training.…

Sound · Computer Science 2021-06-22 Hongqiang Du , Lei Xie

Token-level Speaker Change Detection Using Speaker Difference and Speech Content via Continuous Integrate-and-fire

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection…

Sound · Computer Science 2022-11-18 Zhiyun Fan , Zhenlin Liang , Linhao Dong , Yi Liu , Shiyu Zhou , Meng Cai , Jun Zhang , Zejun Ma , Bo Xu

Multi-Modal Transformers Utterance-Level Code-Switching Detection

An utterance that contains speech from multiple languages is known as a code-switched sentence. In this work, we propose a novel technique to predict whether given audio is mono-lingual or code-switched. We propose a multi-modal learning…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-05 Krishna D N

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change…

Sound · Computer Science 2025-02-06 Peidong Wang , Naoyuki Kanda , Jian Xue , Jinyu Li , Xiaofei Wang , Aswin Shanmugam Subramanian , Junkun Chen , Sunit Sivasankaran , Xiong Xiao , Yong Zhao

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed…

Sound · Computer Science 2020-11-02 Yanpei Shi , Mingjie Chen , Qiang Huang , Thomas Hain

Speech enhancement aided end-to-end multi-task learning for voice activity detection

Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-14 Xu Tan , Xiao-Lei Zhang

Do We Still Need Audio? Rethinking Speaker Diarization with a Text-Based Approach Using Multiple Prediction Models

We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and…

Computation and Language · Computer Science 2025-06-16 Peilin Wu , Jinho D. Choi

Rethinking Speech Recognition with A Multimodal Perspective via Acoustic and Semantic Cooperative Decoding

Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating…

Computation and Language · Computer Science 2023-05-24 Tian-Hao Zhang , Hai-Bo Qin , Zhi-Hao Lai , Song-Lu Chen , Qi Liu , Feng Chen , Xinyuan Qian , Xu-Cheng Yin

Separate and Reconstruct: Asymmetric Encoder-Decoder for Speech Separation

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality…

Computation and Language · Computer Science 2022-07-05 Jinming Zhao , Hao Yang , Ehsan Shareghi , Gholamreza Haffari

Improving Voice Trigger Detection with Metric Learning

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice…

Sound · Computer Science 2022-09-15 Prateeth Nayak , Takuya Higuchi , Anmol Gupta , Shivesh Ranjan , Stephen Shum , Siddharth Sigtia , Erik Marchi , Varun Lakshminarasimhan , Minsik Cho , Saurabh Adya , Chandra Dhir , Ahmed Tewfik