Related papers: Multi-task Learning for Voice Trigger Detection

Multi-task Learning for Speaker Verification and Voice Trigger Detection

Automatic speech transcription and speaker recognition are usually treated as separate tasks even though they are interdependent. In this study, we investigate training a single network to perform both tasks jointly. We train the network in…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-21 Siddharth Sigtia , Erik Marchi , Sachin Kajarekar , Devang Naik , John Bridle

Improving Voice Trigger Detection with Metric Learning

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice…

Sound · Computer Science 2022-09-15 Prateeth Nayak , Takuya Higuchi , Anmol Gupta , Shivesh Ranjan , Stephen Shum , Siddharth Sigtia , Erik Marchi , Varun Lakshminarasimhan , Minsik Cho , Saurabh Adya , Chandra Dhir , Ahmed Tewfik

Progressive Voice Trigger Detection: Accuracy vs Latency

We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-03 Siddharth Sigtia , John Bridle , Hywel Richards , Pascal Clark , Erik Marchi , Vineet Garg

Training Multi-Task Adversarial Network for Extracting Noise-Robust Speaker Embedding

Under noisy environments, to achieve the robust performance of speaker recognition is still a challenging task. Motivated by the promising performance of multi-task training in a variety of image processing tasks, we explore the potential…

Sound · Computer Science 2019-05-14 Jianfeng Zhou , Tao Jiang , Lin Li , Qingyang Hong , Zhe Wang , Bingyin Xia

A Multi-tasking Model of Speaker-Keyword Classification for Keeping Human in the Loop of Drone-assisted Inspection

Audio commands are a preferred communication medium to keep inspectors in the loop of civil infrastructure inspection performed by a semi-autonomous drone. To understand job-specific commands from a group of heterogeneous and dynamic…

Sound · Computer Science 2022-11-02 Yu Li , Anisha Parsan , Bill Wang , Penghao Dong , Shanshan Yao , Ruwen Qin

Towards multi-task learning of speech and speaker recognition

We study multi-task learning for two orthogonal speech technology tasks: speech and speaker recognition. We use wav2vec2 as a base architecture with two task-specific output heads. We experiment with different architectural decisions to mix…

Sound · Computer Science 2023-05-29 Nik Vaessen , David A. van Leeuwen

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-07 Saurabh Adya , Vineet Garg , Siddharth Sigtia , Pramod Simha , Chandra Dhir

An Integrated Framework for Two-pass Personalized Voice Trigger

In this paper, we present the XMUSPEECH system for Task 1 of 2020 Personalized Voice Trigger Challenge (PVTC2020). Task 1 is a joint wake-up word detection with speaker verification on close talking data. The whole system consists of a…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-01 Dexin Liao , Jing Li , Yiming Zhi , Song Li , Qingyang Hong , Lin Li

TIMIT Speaker Profiling: A Comparison of Multi-task learning and Single-task learning Approaches

This study employs deep learning techniques to explore four speaker profiling tasks on the TIMIT dataset, namely gender classification, accent classification, age estimation, and speaker identification, highlighting the potential and…

Sound · Computer Science 2024-04-19 Rong Wang , Kun Sun

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users…

Computation and Language · Computer Science 2024-03-27 Dominik Wagner , Alexander Churchill , Siddharth Sigtia , Panayiotis Georgiou , Matt Mirsamadi , Aarshee Mishra , Erik Marchi

Robust Speaker Recognition Using Speech Enhancement And Attention Model

In this paper, a novel architecture for speaker recognition is proposed by cascading speech enhancement and speaker processing. Its aim is to improve speaker recognition performance when speech signals are corrupted by noise. Instead of…

Computation and Language · Computer Science 2020-05-25 Yanpei Shi , Qiang Huang , Thomas Hain

Multi-Target Backdoor Attacks Against Speaker Recognition

In this work, we propose a multi-target backdoor attack against speaker identification using position-independent clicking sounds as triggers. Unlike previous single-target approaches, our method targets up to 50 speakers simultaneously,…

Sound · Computer Science 2025-10-10 Alexandrine Fortier , Sonal Joshi , Thomas Thebaud , Jesús Villalba , Najim Dehak , Patrick Cardinal

Multichannel Voice Trigger Detection Based on Transform-average-concatenate

Voice triggering (VT) enables users to activate their devices by just speaking a trigger phrase. A front-end system is typically used to perform speech enhancement and/or separation, and produces multiple enhanced and/or separated signals.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-15 Takuya Higuchi , Avamarie Brueggeman , Masood Delfarah , Stephen Shum

Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models

Building a persona-based conversation agent is challenging owing to the lack of large amounts of speaker-specific conversation data for model training. This paper addresses the problem by proposing a multi-task learning approach to training…

Computation and Language · Computer Science 2017-10-23 Yi Luan , Chris Brockett , Bill Dolan , Jianfeng Gao , Michel Galley

Noise-Agnostic Multitask Whisper Training for Reducing False Alarm Errors in Call-for-Help Detection

Keyword spotting is often implemented by keyword classifier to the encoder in acoustic models, enabling the classification of predefined or open vocabulary keywords. Although keyword spotting is a crucial task in various applications and…

Sound · Computer Science 2025-01-22 Myeonghoon Ryu , June-Woo Kim , Minseok Oh , Suji Lee , Han Park

Cross-Lingual Text-to-Speech Using Multi-Task Learning and Speaker Classifier Joint Training

In cross-lingual speech synthesis, the speech in various languages can be synthesized for a monoglot speaker. Normally, only the data of monoglot speakers are available for model training, thus the speaker similarity is relatively low…

Sound · Computer Science 2022-01-21 J. Yang , Lei He

Empowering Whisper as a Joint Multi-Talker and Target-Talker Speech Recognition System

Multi-talker speech recognition and target-talker speech recognition, both involve transcription in multi-talker contexts, remain significant challenges. However, existing methods rarely attempt to simultaneously address both tasks. In this…

Sound · Computer Science 2024-08-27 Lingwei Meng , Jiawen Kang , Yuejiao Wang , Zengrui Jin , Xixin Wu , Xunying Liu , Helen Meng

Multi-task Recurrent Model for True Multilingual Speech Recognition

Research on multilingual speech recognition remains attractive yet challenging. Recent studies focus on learning shared structures under the multi-task paradigm, in particular a feature sharing structure. This approach has been found…

Computation and Language · Computer Science 2016-09-28 Zhiyuan Tang , Lantian Li , Dong Wang

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

A main challenge in applying deep learning to music processing is the availability of training data. One potential solution is Multi-task Learning, in which the model also learns to solve related auxiliary tasks on additional datasets to…

Sound · Computer Science 2018-04-06 Daniel Stoller , Sebastian Ewert , Simon Dixon

Look\&Listen: Multi-Modal Correlation Learning for Active Speaker Detection and Speech Enhancement

Active speaker detection and speech enhancement have become two increasingly attractive topics in audio-visual scenario understanding. According to their respective characteristics, the scheme of independently designed architecture has been…

Sound · Computer Science 2022-07-08 Junwen Xiong , Yu Zhou , Peng Zhang , Lei Xie , Wei Huang , Yufei Zha