Related papers: Y-Vector: Multiscale Waveform Encoder for Speaker …

Online Speaker Adaptation for WaveNet-based Neural Vocoders

In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-17 Qiuchen Huang , Yang Ai , Zhenhua Ling

What Does the Speaker Embedding Encode?

Developing a good speaker embedding has received tremendous interest in the speech community, with representations such as i-vector and d-vector demonstrating remarkable performance across various tasks. Despite their widespread adoption, a…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-23 Shuai Wang , Yanmin Qian , Kai Yu

RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

Recently, direct modeling of raw waveforms using deep neural networks has been widely studied for a number of tasks in audio domains. In speaker verification, however, utilization of raw waveforms is in its preliminary phase, requiring…

Audio and Speech Processing · Electrical Eng. & Systems 2019-07-18 Jee-weon Jung , Hee-Soo Heo , Ju-ho Kim , Hye-jin Shim , Ha-Jin Yu

Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

Recent advances in deep learning have facilitated the design of speaker verification systems that directly input raw waveforms. For example, RawNet extracts speaker embeddings from raw waveforms, which simplifies the process pipeline and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-08 Jee-weon Jung , Seung-bin Kim , Hye-jin Shim , Ju-ho Kim , Ha-Jin Yu

Multi-stream Convolutional Neural Network with Frequency Selection for Robust Speaker Verification

Speaker verification aims to verify whether an input speech corresponds to the claimed speaker, and conventionally, this kind of system is deployed based on single-stream scenario, wherein the feature extractor operates in full frequency…

Sound · Computer Science 2025-09-03 Wei Yao , Shen Chen , Jiamin Cui , Yaolin Lou

An Improved Deep Neural Network for Modeling Speaker Characteristics at Different Temporal Scales

This paper presents an improved deep embedding learning method based on convolutional neural network (CNN) for text-independent speaker verification. Two improvements are proposed for x-vector embedding learning: (1) Multi-scale convolution…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-15 Bin Gu , Wu Guo

On Feature Importance and Interpretability of Speaker Representations

Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components,…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-20 Frederik Rautenberg , Michael Kuhlmann , Jana Wiechmann , Fritz Seebauer , Petra Wagner , Reinhold Haeb-Umbach

T-vectors: Weakly Supervised Speaker Identification Using Hierarchical Transformer Model

Identifying multiple speakers without knowing where a speaker's voice is in a recording is a challenging task. This paper proposes a hierarchical network with transformer encoders and memory mechanism to address this problem. The proposed…

Sound · Computer Science 2020-11-02 Yanpei Shi , Mingjie Chen , Qiang Huang , Thomas Hain

Learning Multiscale Features Directly From Waveforms

Deep learning has dramatically improved the performance of speech recognition systems through learning hierarchies of features optimized for the task at hand. However, true end-to-end learning, where features are learned directly from…

Computation and Language · Computer Science 2016-04-06 Zhenyao Zhu , Jesse H. Engel , Awni Hannun

Speaker Recognition from Raw Waveform with SincNet

Deep learning is progressively gaining popularity as a viable alternative to i-vectors for speaker recognition. Promising results have been recently obtained with Convolutional Neural Networks (CNNs) when fed by raw speech samples directly.…

Audio and Speech Processing · Electrical Eng. & Systems 2019-08-12 Mirco Ravanelli , Yoshua Bengio

Interpreting the Dimensions of Speaker Embedding Space

Speaker embeddings are widely used in speaker verification systems and other applications where it is useful to characterise the voice of a speaker with a fixed-length vector. These embeddings tend to be treated as "black box" encodings,…

Sound · Computer Science 2025-10-21 Mark Huckvale

VoiceVector: Multimodal Enrolment Vectors for Speaker Separation

We present a transformer-based architecture for voice separation of a target speaker from multiple other speakers and ambient noise. We achieve this by using two separate neural networks: (A) An enrolment network designed to craft…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-03 Akam Rahimi , Triantafyllos Afouras , Andrew Zisserman

Probing the Information Encoded in X-vectors

Deep neural network based speaker embeddings, such as x-vectors, have been shown to perform well in text-independent speaker recognition/verification tasks. In this paper, we use simple classifiers to investigate the contents encoded by…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-16 Desh Raj , David Snyder , Daniel Povey , Sanjeev Khudanpur

S-vectors and TESA: Speaker Embeddings and a Speaker Authenticator Based on Transformer Encoder

One of the most popular speaker embeddings is x-vectors, which are obtained from an architecture that gradually builds a larger temporal context with layers. In this paper, we propose to derive speaker embeddings from Transformer's encoder…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-14 N J Metilda Sagaya Mary , S Umesh , Sandesh V Katta

Improved Source Counting and Separation for Monaural Mixture

Single-channel speech separation in time domain and frequency domain has been widely studied for voice-driven applications over the past few years. Most of previous works assume known number of speakers in advance, however, which is not…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-02 Yiming Xiao , Haijian Zhang

Learning Environmental Sounds with Multi-scale Convolutional Neural Network

Deep learning has dramatically improved the performance of sounds recognition. However, learning acoustic models directly from the raw waveform is still challenging. Current waveform-based models generally use time-domain convolutional…

Sound · Computer Science 2018-03-29 Boqing Zhu , Changjian Wang , Feng Liu , Jin Lei , Zengquan Lu , Yuxing Peng

Speaker Recognition using SincNet and X-Vector Fusion

In this paper, we propose an innovative approach to perform speaker recognition by fusing two recently introduced deep neural networks (DNNs) namely - SincNet and X-Vector. The idea behind using SincNet filters on the raw speech waveform is…

Computation and Language · Computer Science 2020-04-07 Mayank Tripathi , Divyanshu Singh , Seba Susan

Attention-based conditioning methods using variable frame rate for style-robust speaker verification

We propose an approach to extract speaker embeddings that are robust to speaking style variations in text-independent speaker verification. Typically, speaker embedding extraction includes training a DNN for speaker classification and using…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-29 Amber Afshan , Abeer Alwan

FDN: Finite Difference Network with Hierarchical Convolutional Features for Text-independent Speaker Verification

In recent years, using raw waveforms as input for deep networks has been widely explored for the speaker verification system. For example, RawNet and RawNet2 extracted speaker's feature embeddings from waveforms automatically for…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-08 Jin Li , Nan Yan , Lan Wang

Speaker Verification in Multi-Speaker Environments Using Temporal Feature Fusion

Verifying the identity of a speaker is crucial in modern human-machine interfaces, e.g., to ensure privacy protection or to enable biometric authentication. Classical speaker verification (SV) approaches estimate a fixed-dimensional…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-29 Ahmad Aloradi , Wolfgang Mack , Mohamed Elminshawi , Emanuël A. P. Habets