English
Related papers

Related papers: Encoder-decoder multimodal speaker change detectio…

200 papers

Speaker change detection (SCD) is an important feature that improves the readability of the recognized words from an automatic speech recognition (ASR) system by breaking the word sequence into paragraphs at speaker change points. Existing…

Audio and Speech Processing · Electrical Eng. & Systems 2023-02-20 Jian Wu , Zhuo Chen , Min Hu , Xiong Xiao , Jinyu Li

We introduce a multilingual speaker change detection model (USM-SCD) that can simultaneously detect speaker turns and perform ASR for 96 languages. This model is adapted from a speech foundation model trained on a large quantity of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-09 Guanlong Zhao , Yongqiang Wang , Jason Pelecanos , Yu Zhang , Hank Liao , Yiling Huang , Han Lu , Quan Wang

Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-19 Jagabandhu Mishra , S. R. Mahadeva Prasanna

In this work we propose a novel token-based training strategy that improves Transformer-Transducer (T-T) based speaker change detection (SCD) performance. The conventional T-T based SCD model loss optimizes all output tokens equally. Due to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Guanlong Zhao , Quan Wang , Han Lu , Yiling Huang , Ignacio Lopez Moreno

Speaker change detection is an important task in multi-party interactions such as meetings and conversations. In this paper, we address the speaker change detection task from the perspective of sequence transduction. Specifically, we…

Sound · Computer Science 2022-06-28 Zhiyun Fan , Linhao Dong , Meng Cai , Zejun Ma , Bo Xu

Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-19 Xinyuan Zhou , Emre Yılmaz , Yanhua Long , Yijie Li , Haizhou Li

Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Speaker change detection (SCD) refers to locating the speaker change points in a multispeaker speech signal. The objective…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Jagabandhu Mishra , S. R. Mahadeva Prasanna

We present a deep-learning approach for the task of Concurrent Speaker Detection (CSD) using a modified transformer model. Our model is designed to handle multi-microphone data but can also work in the single-microphone case. The method can…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-12 Amit Eliav , Sharon Gannot

Concurrent Speaker Detection (CSD), the task of identifying active speakers and their overlaps in an audio signal, is essential for various audio applications, including meeting transcription, speaker diarization, and speech separation.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-16 Amit Eliav , Sharon Gannot

One-shot voice conversion has received significant attention since only one utterance from source speaker and target speaker respectively is required. Moreover, source speaker and target speaker do not need to be seen during training.…

Sound · Computer Science 2021-06-22 Hongqiang Du , Lei Xie

In multi-talker scenarios such as meetings and conversations, speech processing systems are usually required to segment the audio and then transcribe each segmentation. These two stages are addressed separately by speaker change detection…

Sound · Computer Science 2022-11-18 Zhiyun Fan , Zhenlin Liang , Linhao Dong , Yi Liu , Shiyu Zhou , Meng Cai , Jun Zhang , Zejun Ma , Bo Xu

An utterance that contains speech from multiple languages is known as a code-switched sentence. In this work, we propose a novel technique to predict whether given audio is mono-lingual or code-switched. We propose a multi-modal learning…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-05 Krishna D N

Streaming multi-talker speech translation is a task that involves not only generating accurate and fluent translations with low latency but also recognizing when a speaker change occurs and what the speaker's gender is. Speaker change…

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed…

Sound · Computer Science 2020-11-02 Yanpei Shi , Mingjie Chen , Qiang Huang , Thomas Hain

Robust voice activity detection (VAD) is a challenging task in low signal-to-noise (SNR) environments. Recent studies show that speech enhancement is helpful to VAD, but the performance improvement is limited. To address this issue, here we…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-14 Xu Tan , Xiao-Lei Zhang

We present a novel approach to Speaker Diarization (SD) by leveraging text-based methods focused on Sentence-level Speaker Change Detection within dialogues. Unlike audio-based SD systems, which are often challenged by audio quality and…

Computation and Language · Computer Science 2025-06-16 Peilin Wu , Jinho D. Choi

Attention-based encoder-decoder (AED) models have shown impressive performance in ASR. However, most existing AED methods neglect to simultaneously leverage both acoustic and semantic features in decoder, which is crucial for generating…

Computation and Language · Computer Science 2023-05-24 Tian-Hao Zhang , Hai-Bo Qin , Zhi-Hao Lai , Song-Lu Chen , Qi Liu , Feng Chen , Xinyuan Qian , Xu-Cheng Yin

In speech separation, time-domain approaches have successfully replaced the time-frequency domain with latent sequence feature from a learnable encoder. Conventionally, the feature is separated into speaker-specific ones at the final stage…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-01 Ui-Hyeop Shin , Sangyoun Lee , Taehan Kim , Hyung-Min Park

End-to-end speech-to-text translation models are often initialized with pre-trained speech encoder and pre-trained text decoder. This leads to a significant training gap between pre-training and fine-tuning, largely due to the modality…

Computation and Language · Computer Science 2022-07-05 Jinming Zhao , Hao Yang , Ehsan Shareghi , Gholamreza Haffari

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice…

‹ Prev 1 2 3 10 Next ›