Related papers: Unsupervised Speech Representation Pooling Using V…

Enhancing Sentence Embedding with Generalized Pooling

Pooling is an essential component of a wide variety of sentence representation and embedding models. This paper explores generalized pooling methods to enhance sentence embedding. We propose vector-based multi-head attention that includes…

Computation and Language · Computer Science 2022-02-24 Qian Chen , Zhen-Hua Ling , Xiaodan Zhu

Exploring a Unified Attention-Based Pooling Framework for Speaker Verification

The pooling layer is an essential component in the neural network based speaker verification. Most of the current networks in speaker verification use average pooling to derive the utterance-level speaker representations. Average pooling…

Sound · Computer Science 2018-08-23 Yi Liu , Liang He , Weiwei Liu , Jia Liu

Removing Speaker Information from Speech Representation using Variable-Length Soft Pooling

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker…

Sound · Computer Science 2024-04-02 Injune Hwang , Kyogu Lee

Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations

Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-19 Themos Stafylakis , Ladislav Mosner , Sofoklis Kakouros , Oldrich Plchot , Lukas Burget , Jan Cernocky

A Brief Overview of Unsupervised Neural Speech Representation Learning

Unsupervised representation learning for speech processing has matured greatly in the last few years. Work in computer vision and natural language processing has paved the way, but speech data offers unique challenges. As a result, methods…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-04 Lasse Borgholt , Jakob Drachmann Havtorn , Joakim Edin , Lars Maaløe , Christian Igel

Attentive Statistics Pooling for Deep Speaker Embedding

This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an…

Audio and Speech Processing · Electrical Eng. & Systems 2019-02-27 Koji Okabe , Takafumi Koshinaka , Koichi Shinoda

Recursive Attentive Pooling for Extracting Speaker Embeddings from Multi-Speaker Recordings

This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-02 Shota Horiguchi , Atsushi Ando , Takafumi Moriya , Takanori Ashihara , Hiroshi Sato , Naohiro Tawara , Marc Delcroix

On Investigation of Unsupervised Speech Factorization Based on Normalization Flow

Speech signals are complex composites of various information, including phonetic content, speaker traits, channel effect, etc. Decomposing this complicated mixture into independent factors, i.e., speech factorization, is fundamentally…

Sound · Computer Science 2019-10-30 Haoran Sun , Yunqi Cai , Lantian Li , Dong Wang

An Unsupervised Autoregressive Model for Speech Representation Learning

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is…

Computation and Language · Computer Science 2019-06-20 Yu-An Chung , Wei-Ning Hsu , Hao Tang , James Glass

Self-Supervised Speech Representation Learning: A Review

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and…

Computation and Language · Computer Science 2022-11-23 Abdelrahman Mohamed , Hung-yi Lee , Lasse Borgholt , Jakob D. Havtorn , Joakim Edin , Christian Igel , Katrin Kirchhoff , Shang-Wen Li , Karen Livescu , Lars Maaløe , Tara N. Sainath , Shinji Watanabe

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing…

Computation and Language · Computer Science 2023-03-16 Ramon Sanabria , Hao Tang , Sharon Goldwater

Revisiting Self-supervised Learning of Speech Representation from a Mutual Information Perspective

Existing studies on self-supervised speech representation learning have focused on developing new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-18 Alexander H. Liu , Sung-Lin Yeh , James Glass

Supervised attention for speaker recognition

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context…

Sound · Computer Science 2020-12-04 Seong Min Kye , Joon Son Chung , Hoirin Kim

Privacy-Preserving Speech Representation Learning using Vector Quantization

With the popularity of virtual assistants (e.g., Siri, Alexa), the use of speech recognition is now becoming more and more widespread.However, speech signals contain a lot of sensitive information, such as the speaker's identity, which…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-21 Pierre Champion , Denis Jouvet , Anthony Larcher

Speaker Characterization by means of Attention Pooling

State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-08 Federico Costa , Miquel India , Javier Hernando

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-22 Zakaria Aldeneh , Takuya Higuchi , Jee-weon Jung , Li-Wei Chen , Stephen Shum , Ahmed Hussen Abdelaziz , Shinji Watanabe , Tatiana Likhomanenko , Barry-John Theobald

Learning An Invariant Speech Representation

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust…

Sound · Computer Science 2014-06-17 Georgios Evangelopoulos , Stephen Voinea , Chiyuan Zhang , Lorenzo Rosasco , Tomaso Poggio

Data Quality as Predictor of Voice Anti-Spoofing Generalization

Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-23 Bhusan Chettri , Rosa González Hautamäki , Md Sahidullah , Tomi Kinnunen

On the difficulty of a distributional semantics of spoken language

In the domain of unsupervised learning most work on speech has focused on discovering low-level constructs such as phoneme inventories or word-like units. In contrast, for written language, where there is a large body of work on…

Computation and Language · Computer Science 2018-10-29 Grzegorz Chrupała , Lieke Gelderloos , Ákos Kádár , Afra Alishahi

Rethinking Leveraging Pre-Trained Multi-Layer Representations for Speaker Verification

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static…

Sound · Computer Science 2025-12-30 Jin Sob Kim , Hyun Joon Park , Wooseok Shin , Sung Won Han