English

Learnable MFCCs for Speaker Verification

Sound 2021-02-23 v1 Machine Learning Audio and Speech Processing

Abstract

We propose a learnable mel-frequency cepstral coefficient (MFCC) frontend architecture for deep neural network (DNN) based automatic speaker verification. Our architecture retains the simplicity and interpretability of MFCC-based features while allowing the model to be adapted to data flexibly. In practice, we formulate data-driven versions of the four linear transforms of a standard MFCC extractor -- windowing, discrete Fourier transform (DFT), mel filterbank and discrete cosine transform (DCT). Results reported reach up to 6.7\% (VoxCeleb1) and 9.7\% (SITW) relative improvement in term of equal error rate (EER) from static MFCCs, without additional tuning effort.

Keywords

Cite

@article{arxiv.2102.10322,
  title  = {Learnable MFCCs for Speaker Verification},
  author = {Xuechen Liu and Md Sahidullah and Tomi Kinnunen},
  journal= {arXiv preprint arXiv:2102.10322},
  year   = {2021}
}

Comments

Accepted to ISCAS 2021