English
Related papers

Related papers: Unsupervised Speech Representation Pooling Using V…

200 papers

Pooling is an essential component of a wide variety of sentence representation and embedding models. This paper explores generalized pooling methods to enhance sentence embedding. We propose vector-based multi-head attention that includes…

Computation and Language · Computer Science 2022-02-24 Qian Chen , Zhen-Hua Ling , Xiaodan Zhu

The pooling layer is an essential component in the neural network based speaker verification. Most of the current networks in speaker verification use average pooling to derive the utterance-level speaker representations. Average pooling…

Sound · Computer Science 2018-08-23 Yi Liu , Liang He , Weiwei Liu , Jia Liu

Recently, there have been efforts to encode the linguistic information of speech using a self-supervised framework for speech synthesis. However, predicting representations from surrounding representations can inadvertently entangle speaker…

Sound · Computer Science 2024-04-02 Injune Hwang , Kyogu Lee

Self-supervised learning of speech representations from large amounts of unlabeled data has enabled state-of-the-art results in several speech processing tasks. Aggregating these speech representations across time is typically approached by…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-19 Themos Stafylakis , Ladislav Mosner , Sofoklis Kakouros , Oldrich Plchot , Lukas Burget , Jan Cernocky

Unsupervised representation learning for speech processing has matured greatly in the last few years. Work in computer vision and natural language processing has paved the way, but speech data offers unique challenges. As a result, methods…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-04 Lasse Borgholt , Jakob Drachmann Havtorn , Joakim Edin , Lars Maaløe , Christian Igel

This paper proposes attentive statistics pooling for deep speaker embedding in text-independent speaker verification. In conventional speaker embedding, frame-level features are averaged over all the frames of a single utterance to form an…

Audio and Speech Processing · Electrical Eng. & Systems 2019-02-27 Koji Okabe , Takafumi Koshinaka , Koichi Shinoda

This paper proposes a method for extracting speaker embedding for each speaker from a variable-length recording containing multiple speakers. Speaker embeddings are crucial not only for speaker recognition but also for various multi-speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-02 Shota Horiguchi , Atsushi Ando , Takafumi Moriya , Takanori Ashihara , Hiroshi Sato , Naohiro Tawara , Marc Delcroix

Speech signals are complex composites of various information, including phonetic content, speaker traits, channel effect, etc. Decomposing this complicated mixture into independent factors, i.e., speech factorization, is fundamentally…

Sound · Computer Science 2019-10-30 Haoran Sun , Yunqi Cai , Lantian Li , Dong Wang

This paper proposes a novel unsupervised autoregressive neural model for learning generic speech representations. In contrast to other speech representation learning methods that aim to remove noise or speaker variabilities, ours is…

Computation and Language · Computer Science 2019-06-20 Yu-An Chung , Wei-Ning Hsu , Hao Tang , James Glass

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and…

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing…

Computation and Language · Computer Science 2023-03-16 Ramon Sanabria , Hao Tang , Sharon Goldwater

Existing studies on self-supervised speech representation learning have focused on developing new training methods and applying pre-trained models for different applications. However, the quality of these models is often measured by the…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-18 Alexander H. Liu , Sung-Lin Yeh , James Glass

The recently proposed self-attentive pooling (SAP) has shown good performance in several speaker recognition systems. In SAP systems, the context vector is trained end-to-end together with the feature extractor, where the role of context…

Sound · Computer Science 2020-12-04 Seong Min Kye , Joon Son Chung , Hoirin Kim

With the popularity of virtual assistants (e.g., Siri, Alexa), the use of speech recognition is now becoming more and more widespread.However, speech signals contain a lot of sensitive information, such as the speaker's identity, which…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-21 Pierre Champion , Denis Jouvet , Anthony Larcher

State-of-the-art Deep Learning systems for speaker verification are commonly based on speaker embedding extractors. These architectures are usually composed of a feature extractor front-end together with a pooling layer to encode…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-08 Federico Costa , Miquel India , Javier Hernando

Iterative self-training, or iterative pseudo-labeling (IPL) -- using an improved model from the current iteration to provide pseudo-labels for the next iteration -- has proven to be a powerful approach to enhance the quality of speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-22 Zakaria Aldeneh , Takuya Higuchi , Jee-weon Jung , Li-Wei Chen , Stephen Shum , Ahmed Hussen Abdelaziz , Shinji Watanabe , Tatiana Likhomanenko , Barry-John Theobald

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust…

Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-23 Bhusan Chettri , Rosa González Hautamäki , Md Sahidullah , Tomi Kinnunen

In the domain of unsupervised learning most work on speech has focused on discovering low-level constructs such as phoneme inventories or word-like units. In contrast, for written language, where there is a large body of work on…

Computation and Language · Computer Science 2018-10-29 Grzegorz Chrupała , Lieke Gelderloos , Ákos Kádár , Afra Alishahi

Recent speaker verification studies have achieved notable success by leveraging layer-wise output from pre-trained Transformer models. However, few have explored the advancements in aggregating these multi-level features beyond the static…

Sound · Computer Science 2025-12-30 Jin Sob Kim , Hyun Joon Park , Wooseok Shin , Sung Won Han
‹ Prev 1 2 3 10 Next ›