Related papers: Speech Enhancement with Multi-granularity Vector Q…

Speech Enhancement Using Self-Supervised Pre-Trained Model and Vector Quantization

With the development of deep learning, neural network-based speech enhancement (SE) models have shown excellent performance. Meanwhile, it was shown that the development of self-supervised pre-trained models can be applied to various…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-29 Xiao-Ying Zhao , Qiu-Shi Zhu , Jie Zhang

Self-Supervised Speech Quality Estimation and Enhancement Using Only Clean Speech

Speech quality estimation has recently undergone a paradigm shift from human-hearing expert designs to machine-learning models. However, current models rely mainly on supervised learning, which is time-consuming and expensive for label…

Sound · Computer Science 2024-02-27 Szu-Wei Fu , Kuo-Hsuan Hung , Yu Tsao , Yu-Chiang Frank Wang

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders

Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-16 Xingwei Sun , Heinrich Dinkel , Yadong Niu , Linzhang Wang , Junbo Zhang , Jian Luan

Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

In this paper, we explore vector quantization for acoustic unit discovery. Leveraging unlabelled data, we aim to learn discrete representations of speech that separate phonetic content from speaker-specific details. We propose two neural…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-20 Benjamin van Niekerk , Leanne Nortje , Herman Kamper

VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention

The deep learning-based speech enhancement (SE) methods always take the clean speech's waveform or time-frequency spectrum feature as the learning target, and train the deep neural network (DNN) by reducing the error loss between the DNN's…

Audio and Speech Processing · Electrical Eng. & Systems 2023-11-02 Yuewei Zhang , Huanbin Zou , Jie Zhu

VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net architecture

Voice conversion (VC) is a task that transforms the source speaker's timbre, accent, and tones in audio into another one's while preserving the linguistic content. It is still a challenging work, especially in a one-shot setting.…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-09 Da-Yi Wu , Yen-Hao Chen , Hung-Yi Lee

Increasing Compactness Of Deep Learning Based Speech Enhancement Models With Parameter Pruning And Quantization Techniques

Most recent studies on deep learning based speech enhancement (SE) focused on improving denoising performance. However, successful SE applications require striking a desirable balance between denoising performance and computational cost in…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-08 Jyun-Yi Wu , Cheng Yu , Szu-Wei Fu , Chih-Ting Liu , Shao-Yi Chien , Yu Tsao

Self-Supervised Learning for Speech Enhancement through Synthesis

Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-07 Bryce Irvin , Marko Stamenovic , Mikolaj Kegler , Li-Chia Yang

Can We Trust Deep Speech Prior?

Recently, speech enhancement (SE) based on deep speech prior has attracted much attention, such as the variational auto-encoder with non-negative matrix factorization (VAE-NMF) architecture. Compared to conventional approaches that…

Sound · Computer Science 2020-11-05 Ying Shi , Haolin Chen , Zhiyuan Tang , Lantian Li , Dong Wang , Jiqing Han

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion

One-shot voice conversion (VC), which performs conversion across arbitrary speakers with only a single target-speaker utterance for reference, can be effectively achieved by speech representation disentanglement. Existing work generally…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-22 Disong Wang , Liqun Deng , Yu Ting Yeung , Xiao Chen , Xunying Liu , Helen Meng

Causal Speech Enhancement with Predicting Semantics based on Quantized Self-supervised Learning Features

Real-time speech enhancement (SE) is essential to online speech communication. Causal SE models use only the previous context while predicting future information, such as phoneme continuation, may help performing causal SE. The phonetic…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-30 Emiru Tsunoo , Yuki Saito , Wataru Nakata , Hiroshi Saruwatari

Speech Tokenizer is Key to Consistent Representation

Speech tokenization is crucial in digital speech processing, converting continuous speech signals into discrete units for various computational tasks. This paper introduces a novel speech tokenizer with broad applicability across downstream…

Machine Learning · Computer Science 2025-07-10 Wonjin Jung , Sungil Kang , Dong-Yeon Cho

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent…

Sound · Computer Science 2022-04-19 Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai , Yu Tsao , Hsiu-Wen Chang , Hsin-Min Wang

Audio-Visual Speech Enhancement Using Multimodal Deep Convolutional Neural Networks

Speech enhancement (SE) aims to reduce noise in speech signals. Most SE techniques focus only on addressing audio information. In this work, inspired by multimodal learning, which utilizes data from different modalities, and the recent…

Sound · Computer Science 2018-01-25 Jen-Cheng Hou , Syu-Siang Wang , Ying-Hui Lai , Yu Tsao , Hsiu-Wen Chang , Hsin-Min Wang

A vector quantized masked autoencoder for speech emotion recognition

Recent years have seen remarkable progress in speech emotion recognition (SER), thanks to advances in deep learning techniques. However, the limited availability of labeled data remains a significant challenge in the field. Self-supervised…

Sound · Computer Science 2023-04-24 Samir Sadok , Simon Leglaive , Renaud Séguier

Feature Normalization for Fine-tuning Self-Supervised Models in Speech Enhancement

Large, pre-trained representation models trained using self-supervised learning have gained popularity in various fields of machine learning because they are able to extract high-quality salient features from input data. As such, they have…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-16 Hejung Yang , Hong-Goo Kang

ParaGSE: Parallel Generative Speech Enhancement with Group-Vector-Quantization-based Neural Speech Codec

Recently, generative speech enhancement has garnered considerable interest; however, existing approaches are hindered by excessive complexity, limited efficiency, and suboptimal speech quality. To overcome these challenges, this paper…

Sound · Computer Science 2026-02-03 Fei Liu , Yang Ai

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion. On the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-27 Qiu-Shi Zhu , Jie Zhang , Zi-Qiang Zhang , Li-Rong Dai

Mitigating Premature Discretization with Progressive Quantization for Robust Vector Tokenization

Vector Quantization (VQ) has become the cornerstone of tokenization for many multimodal Large Language Models and diffusion synthesis. However, existing VQ paradigms suffer from a fundamental conflict: they enforce discretization before the…

Machine Learning · Computer Science 2026-03-25 Wenhao Zhao , Qiran Zou , Zhouhan Lin , Dianbo Liu

Cross-Scale Vector Quantization for Scalable Neural Speech Coding

Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which…

Sound · Computer Science 2022-07-08 Xue Jiang , Xiulian Peng , Huaying Xue , Yuan Zhang , Yan Lu