Related papers: Fast query-by-example speech search using separabl…

Semantic query-by-example speech search using visual grounding

A number of recent studies have started to investigate how speech systems can be trained on untranscribed speech by leveraging accompanying images at training time. Examples of tasks include keyword prediction and within- and across-mode…

Computation and Language · Computer Science 2019-04-16 Herman Kamper , Aristotelis Anastassiou , Karen Livescu

Acoustic span embeddings for multilingual query-by-example search

Query-by-example (QbE) speech search is the task of matching spoken queries to utterances within a search collection. In low- or zero-resource settings, QbE search is often addressed with approaches based on dynamic time warping (DTW).…

Computation and Language · Computer Science 2020-11-25 Yushi Hu , Shane Settle , Karen Livescu

H-QuEST: Accelerating Query-by-Example Spoken Term Detection with Hierarchical Indexing

Query-by-example spoken term detection (QbE-STD) searches for matching words or phrases in an audio dataset using a sample spoken query. When annotated data is limited or unavailable, QbE-STD is often done using template matching methods…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-23 Akanksha Singh , Yi-Ping Phoebe Chen , Vipul Arora

Cross-Lingual Query-by-Example Spoken Term Detection: A Transformer-Based Approach

Query-by-example spoken term detection (QbE-STD) is typically constrained by transcribed data scarcity and language specificity. This paper introduces a novel, language-agnostic QbE-STD model leveraging image processing techniques and…

Machine Learning · Computer Science 2024-10-08 Allahdadi Fatemeh , Mahdian Toroghi Rahil , Zareian Hassan

Learning Acoustic Word Embeddings with Temporal Context for Query-by-Example Speech Search

We propose to learn acoustic word embeddings with temporal context for query-by-example (QbE) speech search. The temporal context includes the leading and trailing word sequences of a word. We assume that there exist spoken word pairs in…

Computation and Language · Computer Science 2018-06-19 Yougen Yuan , Cheung-Chi Leung , Lei Xie , Hongjie Chen , Bin Ma , Haizhou Li

Query-by-Example Search with Discriminative Neural Acoustic Word Embeddings

Query-by-example search often uses dynamic time warping (DTW) for comparing queries and proposed matching segments. Recent work has shown that comparing speech segments by representing them as fixed-dimensional vectors --- acoustic word…

Computation and Language · Computer Science 2017-06-14 Shane Settle , Keith Levin , Herman Kamper , Karen Livescu

Acoustic Word Embedding System for Code-Switching Query-by-example Spoken Term Detection

In this paper, we propose a deep convolutional neural network-based acoustic word embedding system on code-switching query by example spoken term detection. Different from previous configurations, we combine audio data in two languages for…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-26 Murong Ma , Haiwei Wu , Xuyang Wang , Lin Yang , Junjie Wang , Ming Li

Efficient Speech Quality Assessment using Self-supervised Framewise Embeddings

Automatic speech quality assessment is essential for audio researchers, developers, speech and language pathologists, and system quality engineers. The current state-of-the-art systems are based on framewise speech features (hand-engineered…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-15 Karl El Hajal , Zihan Wu , Neil Scheidwasser-Clow , Gasser Elbanna , Milos Cernak

Neural Network based End-to-End Query by Example Spoken Term Detection

This paper focuses on the problem of query by example spoken term detection (QbE-STD) in zero-resource scenario. State-of-the-art approaches primarily rely on dynamic time warping (DTW) based template matching techniques using phone…

Audio and Speech Processing · Electrical Eng. & Systems 2019-11-20 Dhananjay Ram , Lesly Miculicich , Hervé Bourlard

Whole-Word Segmental Speech Recognition with Acoustic Word Embeddings

Segmental models are sequence prediction models in which scores of hypotheses are based on entire variable-length segments of frames. We consider segmental models for whole-word ("acoustic-to-word") speech recognition, with the feature…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-25 Bowen Shi , Shane Settle , Karen Livescu

Query-by-Example Keyword Spotting Using Spectral-Temporal Graph Attentive Pooling and Multi-Task Learning

Existing keyword spotting (KWS) systems primarily rely on predefined keyword phrases. However, the ability to recognize customized keywords is crucial for tailoring interactions with intelligent devices. In this paper, we present a novel…

Computation and Language · Computer Science 2024-11-26 Zhenyu Wang , Shuyu Kong , Li Wan , Biqiao Zhang , Yiteng Huang , Mumin Jin , Ming Sun , Xin Lei , Zhaojun Yang

Neural approaches to spoken content embedding

Comparing spoken segments is a central operation to speech processing. Traditional approaches in this area have favored frame-level dynamic programming algorithms, such as dynamic time warping, because they require no supervision, but they…

Computation and Language · Computer Science 2023-08-30 Shane Settle

Improving Query-by-Vocal Imitation with Contrastive Learning and Audio Pretraining

Query-by-Vocal Imitation (QBV) is about searching audio files within databases using vocal imitations created by the user's voice. Since most humans can effectively communicate sound concepts through voice, QBV offers the more intuitive and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-22 Jonathan Greif , Florian Schmid , Paul Primus , Gerhard Widmer

Multilingual Bottleneck Features for Query by Example Spoken Term Detection

State of the art solutions to query by example spoken term detection (QbE-STD) usually rely on bottleneck feature representation of the query and audio document to perform dynamic time warping (DTW) based template matching. Here, we present…

Computation and Language · Computer Science 2019-07-02 Dhananjay Ram , Lesly Miculicich , Hervé Bourlard

Using Word Embeddings for Automatic Query Expansion

In this paper a framework for Automatic Query Expansion (AQE) is proposed using distributed neural language model word2vec. Using semantic and contextual relation in a distributed and unsupervised framework, word2vec learns a low…

Information Retrieval · Computer Science 2016-06-27 Dwaipayan Roy , Debjyoti Paul , Mandar Mitra , Utpal Garain

Multilingual Jointly Trained Acoustic and Written Word Embeddings

Acoustic word embeddings (AWEs) are vector representations of spoken word segments. AWEs can be learned jointly with embeddings of character sequences, to generate phonetically meaningful embeddings of written words, or acoustically…

Computation and Language · Computer Science 2020-06-26 Yushi Hu , Shane Settle , Karen Livescu

Attention-Based Audio Embeddings for Query-by-Example

An ideal audio retrieval system efficiently and robustly recognizes a short query snippet from an extensive database. However, the performance of well-known audio fingerprinting systems falls short at high signal distortion levels. This…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-22 Anup Singh , Kris Demuynck , Vipul Arora

Improving Speech Enhancement via Event-based Query

Existing deep learning based speech enhancement (SE) methods either use blind end-to-end training or explicitly incorporate speaker embedding or phonetic information into the SE network to enhance speech quality. In this paper, we perceive…

Sound · Computer Science 2023-02-27 Yifei Xin , Xiulian Peng , Yan Lu

Informed Source Extraction With Application to Acoustic Echo Reduction

Informed speaker extraction aims to extract a target speech signal from a mixture of sources given prior knowledge about the desired speaker. Recent deep learning-based methods leverage a speaker discriminative model that maps a reference…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-17 Mohamed Elminshawi , Wolfgang Mack , Emanuël A. P. Habets

Improving Acoustic Word Embeddings through Correspondence Training of Self-supervised Speech Representations

Acoustic word embeddings (AWEs) are vector representations of spoken words. An effective method for obtaining AWEs is the Correspondence Auto-Encoder (CAE). In the past, the CAE method has been associated with traditional MFCC features.…

Computation and Language · Computer Science 2024-03-14 Amit Meghanani , Thomas Hain