Related papers: Knowledge Transfer for Efficient On-device False T…

Lattice-based Improvements for Voice Triggering Using Graph Neural Networks

Voice-triggered smart assistants often rely on detection of a trigger-phrase before they start listening for the user request. Mitigation of false triggers is an important aspect of building a privacy-centric non-intrusive smart assistant.…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-30 Pranay Dighe , Saurabh Adya , Nuoyu Li , Srikanth Vishnubhotla , Devang Naik , Adithya Sagar , Ying Ma , Stephen Pulman , Jason Williams

Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation

We present a unified and hardware efficient architecture for two stage voice trigger detection (VTD) and false trigger mitigation (FTM) tasks. Two stage VTD systems of voice assistants can get falsely activated to audio segments…

Audio and Speech Processing · Electrical Eng. & Systems 2021-05-17 Vineet Garg , Wonil Chang , Siddharth Sigtia , Saurabh Adya , Pramod Simha , Pranay Dighe , Chandra Dhir

Device-Directed Speech Detection: Regularization via Distillation for Weakly-Supervised Models

We address the problem of detecting speech directed to a device that does not contain a specific wake-word. Specifically, we focus on audio coming from a touch-based invocation. Mitigating virtual assistants (VAs) activation due to…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-31 Vineet Garg , Ognjen Rudovic , Pranay Dighe , Ahmed H. Abdelaziz , Erik Marchi , Saurabh Adya , Chandra Dhir , Ahmed Tewfik

Streaming on-device detection of device directed speech from voice and touch-based invocation

When interacting with smart devices such as mobile phones or wearables, the user typically invokes a virtual assistant (VA) by saying a keyword or by pressing a button on the device. However, in many cases, the VA can accidentally be…

Sound · Computer Science 2021-10-12 Ognjen Rudovic , Akanksha Bindal , Vineet Garg , Pramod Simha , Pranay Dighe , Sachin Kajarekar

Complementary Language Model and Parallel Bi-LRNN for False Trigger Mitigation

False triggers in voice assistants are unintended invocations of the assistant, which not only degrade the user experience but may also compromise privacy. False trigger mitigation (FTM) is a process to detect the false trigger events and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-20 Rishika Agarwal , Xiaochuan Niu , Pranay Dighe , Srikanth Vishnubhotla , Sameer Badaskar , Devang Naik

Device-directed Utterance Detection

In this work, we propose a classifier for distinguishing device-directed queries from background speech in the context of interactions with voice assistants. Applications include rejection of false wake-ups or unintended interactions as…

Computation and Language · Computer Science 2018-08-09 Sri Harish Mallidi , Roland Maas , Kyle Goehner , Ariya Rastrow , Spyros Matsoukas , Björn Hoffmeister

Towards Better Understanding of Spontaneous Conversations: Overcoming Automatic Speech Recognition Errors With Intent Recognition

In this paper, we present a method for correcting automatic speech recognition (ASR) errors using a finite state transducer (FST) intent recognition framework. Intent recognition is a powerful technique for dialog flow management in…

Computation and Language · Computer Science 2019-08-22 Piotr Żelasko , Jan Mizgajski , Mikołaj Morzy , Adrian Szymczak , Piotr Szymański , Łukasz Augustyniak , Yishay Carmiel

A Multimodal Approach to Device-Directed Speech Detection with Large Language Models

Interactions with virtual assistants typically start with a predefined trigger phrase followed by the user command. To make interactions with the assistant more intuitive, we explore whether it is feasible to drop the requirement that users…

Computation and Language · Computer Science 2024-03-27 Dominik Wagner , Alexander Churchill , Siddharth Sigtia , Panayiotis Georgiou , Matt Mirsamadi , Aarshee Mishra , Erik Marchi

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still…

Sound · Computer Science 2026-04-28 Chengyou Wang , Hongfei Xue , Chunjiang He , Jingbin Hu , Shuiyuan Wang , Bo Wu , Yuyu Ji , Jimeng Zheng , Ruofei Chen , Zhou Zhu , Lei Xie

Hybrid Transformer/CTC Networks for Hardware Efficient Voice Triggering

We consider the design of two-pass voice trigger detection systems. We focus on the networks in the second pass that are used to re-score candidate segments obtained from the first-pass. Our baseline is an acoustic model(AM), with BiLSTM…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-07 Saurabh Adya , Vineet Garg , Siddharth Sigtia , Pramod Simha , Chandra Dhir

Audio-to-Intent Using Acoustic-Textual Subword Representations from End-to-End ASR

Accurate prediction of the user intent to interact with a voice assistant (VA) on a device (e.g. on the phone) is critical for achieving naturalistic, engaging, and privacy-centric interactions with the VA. To this end, we present a novel…

Computation and Language · Computer Science 2022-10-24 Pranay Dighe , Prateeth Nayak , Oggi Rudovic , Erik Marchi , Xiaochuan Niu , Ahmed Tewfik

Internalizing ASR with Implicit Chain of Thought for Efficient Speech-to-Speech Conversational LLM

Current speech-based LLMs are predominantly trained on extensive ASR and TTS datasets, excelling in tasks related to these domains. However, their ability to handle direct speech-to-speech conversations remains notably constrained. These…

Computation and Language · Computer Science 2024-11-05 Robin Shing-Hei Yuen , Timothy Tin-Long Tse , Jian Zhu

Robust Unstructured Knowledge Access in Conversational Dialogue with ASR Errors

Performance of spoken language understanding (SLU) can be degraded with automatic speech recognition (ASR) errors. We propose a novel approach to improve SLU robustness by randomly corrupting clean training text with an ASR error simulator,…

Computation and Language · Computer Science 2022-11-09 Yik-Cheung Tam , Jiacheng Xu , Jiakai Zou , Zecheng Wang , Tinglong Liao , Shuhan Yuan

Improving Voice Trigger Detection with Metric Learning

Voice trigger detection is an important task, which enables activating a voice assistant when a target user speaks a keyword phrase. A detector is typically trained on speech data independent of speaker information and used for the voice…

Sound · Computer Science 2022-09-15 Prateeth Nayak , Takuya Higuchi , Anmol Gupta , Shivesh Ranjan , Stephen Shum , Siddharth Sigtia , Erik Marchi , Varun Lakshminarasimhan , Minsik Cho , Saurabh Adya , Chandra Dhir , Ahmed Tewfik

Streaming, fast and accurate on-device Inverse Text Normalization for Automatic Speech Recognition

Automatic Speech Recognition (ASR) systems typically yield output in lexical form. However, humans prefer a written form output. To bridge this gap, ASR systems usually employ Inverse Text Normalization (ITN). In previous works, Weighted…

Computation and Language · Computer Science 2022-11-08 Yashesh Gaur , Nick Kibre , Jian Xue , Kangyuan Shu , Yuhui Wang , Issac Alphanso , Jinyu Li , Yifan Gong

An ASR Guided Speech Intelligibility Measure for TTS Model Selection

The perceptual quality of neural text-to-speech (TTS) is highly dependent on the choice of the model during training. Selecting the model using a training-objective metric such as the least mean squared error does not always correlate with…

Sound · Computer Science 2020-06-03 Arun Baby , Saranya Vinnaitherthan , Nagaraj Adiga , Pranav Jawale , Sumukh Badam , Sharath Adavanne , Srikanth Konjeti

Progressive Voice Trigger Detection: Accuracy vs Latency

We present an architecture for voice trigger detection for virtual assistants. The main idea in this work is to exploit information in words that immediately follow the trigger phrase. We first demonstrate that by including more audio…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-03 Siddharth Sigtia , John Bridle , Hywel Richards , Pascal Clark , Erik Marchi , Vineet Garg

Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End

Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-22 Swayambhu Nath Ray , Minhua Wu , Anirudh Raju , Pegah Ghahremani , Raghavendra Bilgi , Milind Rao , Harish Arsikere , Ariya Rastrow , Andreas Stolcke , Jasha Droppo

Data-selective Transfer Learning for Multi-Domain Speech Recognition

Negative transfer in training of acoustic models for automatic speech recognition has been reported in several contexts such as domain change or speaker characteristics. This paper proposes a novel technique to overcome negative transfer by…

Machine Learning · Computer Science 2015-09-18 Mortaza Doulaty , Oscar Saz , Thomas Hain

Real-time Caller Intent Detection In Human-Human Customer Support Spoken Conversations

Agent assistance during human-human customer support spoken interactions requires triggering workflows based on the caller's intent (reason for call). Timeliness of prediction is essential for a good user experience. The goal is for a…

Artificial Intelligence · Computer Science 2022-08-16 Mrinal Rawat , Victor Barres