Related papers: Device-directed Utterance Detection

Improving Device Directedness Classification of Utterances with Semantic Lexical Features

User interactions with personal assistants like Alexa, Google Home and Siri are typically initiated by a wake term or wakeword. Several personal assistants feature "follow-up" modes that allow users to make additional interactions without…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-06 Kellen Gillespie , Ioannis C. Konstantakopoulos , Xingzhi Guo , Vishal Thanvantri Vasudevan , Abhinav Sethy

Device-Directed Speech Detection for Follow-up Conversations Using Large Language Models

Follow-up conversations with virtual assistants (VAs) enable a user to seamlessly interact with a VA without the need to repeatedly invoke it using a keyword (after the first query). Therefore, accurate Device-directed Speech Detection…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-06 Ognjen , Rudovic , Pranay Dighe , Yi Su , Vineet Garg , Sameer Dharur , Xiaochuan Niu , Ahmed H. Abdelaziz , Saurabh Adya , Ahmed Tewfik

Exploring attention mechanism for acoustic-based classification of speech utterances into system-directed and non-system-directed

Voice controlled virtual assistants (VAs) are now available in smartphones, cars, and standalone devices in homes. In most cases, the user needs to first "wake-up" the VA by saying a particular word/phrase every time he or she wants the VA…

Human-Computer Interaction · Computer Science 2019-02-05 Atta Norouzian , Bogdan Mazoure , Dermot Connolly , Daniel Willett

Streaming ResLSTM with Causal Mean Aggregation for Device-Directed Utterance Detection

In this paper, we propose a streaming model to distinguish voice queries intended for a smart-home device from background speech. The proposed model consists of multiple CNN layers with residual connections, followed by a stacked LSTM…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-21 Xiaosu Tong , Che-Wei Huang , Sri Harish Mallidi , Shaun Joseph , Sonal Pareek , Chander Chandak , Ariya Rastrow , Roland Maas

DNN-Based Semantic Model for Rescoring N-best Speech Recognition List

The word error rate (WER) of an automatic speech recognition (ASR) system increases when a mismatch occurs between the training and the testing conditions due to the noise, etc. In this case, the acoustic information can be less reliable.…

Computation and Language · Computer Science 2020-11-03 Dominique Fohr , Irina Illina

Deep Speaker Embeddings for Far-Field Speaker Recognition on Short Utterances

Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions according to the results obtained for early NIST SRE (Speaker Recognition Evaluation) datasets. From the practical…

Sound · Computer Science 2020-02-17 Aleksei Gusev , Vladimir Volokhov , Tseren Andzhukaev , Sergey Novoselov , Galina Lavrentyeva , Marina Volkova , Alice Gazizullina , Andrey Shulipa , Artem Gorlanov , Anastasia Avdeeva , Artem Ivanov , Alexander Kozlov , Timur Pekhovsky , Yuri Matveev

Selective Attention System (SAS): Device-Addressed Speech Detection for Real-Time On-Device Voice AI

We study device-addressed speech detection under pre-ASR edge deployment constraints, where systems must decide whether to forward audio before transcription under strict latency and compute limits. We show that, in multi-speaker…

Sound · Computer Science 2026-04-10 David Joohun Kim , Daniyal Anjum , Bonny Banerjee , Omar Abbasi

Knowledge Transfer for Efficient On-device False Trigger Mitigation

In this paper, we address the task of determining whether a given utterance is directed towards a voice-enabled smart-assistant device or not. An undirected utterance is termed as a "false trigger" and false trigger mitigation (FTM) is…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-22 Pranay Dighe , Erik Marchi , Srikanth Vishnubhotla , Sachin Kajarekar , Devang Naik

ASR-Aware End-to-end Neural Diarization

We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model. Two categories of features are explored: features derived directly…

Computation and Language · Computer Science 2022-07-13 Aparna Khare , Eunjung Han , Yuguang Yang , Andreas Stolcke

Effective Cross-Utterance Language Modeling for Conversational Speech Recognition

Conversational speech normally is embodied with loose syntactic structures at the utterance level but simultaneously exhibits topical coherence relations across consecutive utterances. Prior work has shown that capturing longer context…

Computation and Language · Computer Science 2022-06-02 Bi-Cheng Yan , Hsin-Wei Wang , Shih-Hsuan Chiu , Hsuan-Sheng Chiu , Berlin Chen

Modality Dropout for Multimodal Device Directed Speech Detection using Verbal and Non-Verbal Features

Device-directed speech detection (DDSD) is the binary classification task of distinguishing between queries directed at a voice assistant versus side conversation or background speech. State-of-the-art DDSD systems use verbal cues, e.g…

Sound · Computer Science 2023-10-25 Gautam Krishna , Sameer Dharur , Oggi Rudovic , Pranay Dighe , Saurabh Adya , Ahmed Hussen Abdelaziz , Ahmed H Tewfik

Implicit Acoustic Echo Cancellation for Keyword Spotting and Device-Directed Speech Detection

In many speech-enabled human-machine interaction scenarios, user speech can overlap with the device playback audio. In these instances, the performance of tasks such as keyword-spotting (KWS) and device-directed speech detection (DDD) can…

Sound · Computer Science 2022-10-05 Samuele Cornell , Thomas Balestri , Thibaud Sénéchal

Discriminate natural versus loudspeaker emitted speech

In this work, we address a novel, but potentially emerging, problem of discriminating the natural human voices and those played back by any kind of audio devices in the context of interactions with in-house voice user interface. The tackled…

Sound · Computer Science 2019-02-19 Thanh-Ha Le , Philippe Gilberton , Ngoc Q. K. Duong

Stuttering Detection Using Speaker Representations and Self-supervised Contextual Embeddings

The adoption of advanced deep learning architectures in stuttering detection (SD) tasks is challenging due to the limited size of the available datasets. To this end, this work introduces the application of speech embeddings extracted from…

Sound · Computer Science 2023-06-02 Shakeel A. Sheikh , Md Sahidullah , Fabrice Hirsch , Slim Ouni

Speaker Selective Beamformer with Keyword Mask Estimation

This paper addresses the problem of automatic speech recognition (ASR) of a target speaker in background speech. The novelty of our approach is that we focus on a wakeup keyword, which is usually used for activating ASR systems like smart…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Yusuke Kida , Dung Tran , Motoi Omachi , Toru Taniguchi , Yuya Fujita

Multimodal Data and Resource Efficient Device-Directed Speech Detection with Large Foundation Models

Interactions with virtual assistants typically start with a trigger phrase followed by a command. In this work, we explore the possibility of making these interactions more natural by eliminating the need for a trigger phrase. Our goal is…

Sound · Computer Science 2023-12-07 Dominik Wagner , Alexander Churchill , Siddharth Sigtia , Panayiotis Georgiou , Matt Mirsamadi , Aarshee Mishra , Erik Marchi

Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-31 Vineet Garg , Ognjen Rudovic , Pranay Dighe , Ahmed H. Abdelaziz , Erik Marchi , Saurabh Adya , Chandra Dhir , Ahmed Tewfik

Prediction of speech intelligibility with DNN-based performance measures

This paper presents a speech intelligibility model based on automatic speech recognition (ASR), combining phoneme probabilities from deep neural networks (DNN) and a performance measure that estimates the word error rate from these…

Sound · Computer Science 2022-03-18 Angel Mario Castro Martinez , Constantin Spille , Jana Roßbach , Birger Kollmeier , Bernd T. Meyer

A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition

We present a comprehensive study of deep bidirectional long short-term memory (LSTM) recurrent neural network (RNN) based acoustic models for automatic speech recognition (ASR). We study the effect of size and depth and train models of up…

Neural and Evolutionary Computing · Computer Science 2019-08-06 Albert Zeyer , Patrick Doetsch , Paul Voigtlaender , Ralf Schlüter , Hermann Ney

Adaptive Knowledge Distillation for Device-Directed Speech Detection

Device-directed speech detection (DDSD) is a binary classification task that separates the user's queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience.…

Sound · Computer Science 2025-08-06 Hyung Gun Chi , Florian Pesce , Wonil Chang , Oggi Rudovic , Arturo Argueta , Stefan Braun , Vineet Garg , Ahmed Hussen Abdelaziz