Related papers: Mel-spectrogram augmentation for sequence to seque…

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and…

Sound · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Juan Liu , Yuan Jiang , Li-Rong Dai

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its…

Sound · Computer Science 2024-07-11 Guoqiang Hu , Huaning Tan , Ruilai Li

Improving Sequence-to-Sequence Acoustic Modeling by Adding Text-Supervision

This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic…

Sound · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Yuan Jiang , Li-Juan Liu , Chen Liang , Li-Rong Dai

CycleGAN-VC3: Examining and Improving CycleGAN-VCs for Mel-spectrogram Conversion

Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising…

Sound · Computer Science 2020-10-23 Takuhiro Kaneko , Hirokazu Kameoka , Kou Tanaka , Nobukatsu Hojo

Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

In this work, we propose Mel-FullSubNet, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. Mel-FullSubNet takes as input the noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-23 Rui Zhou , Xian Li , Ying Fang , Xiaofei Li

Pseudo-Cepstrum: Pitch Modification for Mel-Based Neural Vocoders

This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or…

Sound · Computer Science 2025-12-19 Nikolaos Ellinas , Alexandra Vioni , Panos Kakoulidis , Georgios Vamvoukakis , Myrsini Christidou , Konstantinos Markopoulos , Junkwang Oh , Gunu Jho , Inchul Hwang , Aimilios Chalamandaris , Pirros Tsiakoulis

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior…

Sound · Computer Science 2017-08-08 Hiroyuki Miyoshi , Yuki Saito , Shinnosuke Takamichi , Hiroshi Saruwatari

High-quality Speech Synthesis Using Super-resolution Mel-Spectrogram

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-04 Leyuan Sheng , Dong-Yan Huang , Evgeniy N. Pavlovskiy

CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-31 Nian Shao , Rui Zhou , Pengyu Wang , Xian Li , Ying Fang , Yujie Yang , Xiaofei Li

Basic Filters for Convolutional Neural Networks Applied to Music: Training or Design?

When convolutional neural networks are used to tackle learning problems based on music or, more generally, time series data, raw one-dimensional data are commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients, which…

Machine Learning · Computer Science 2018-09-20 Monika Doerfler , Thomas Grill , Roswitha Bammer , Arthur Flexer

Leveraged Mel spectrograms using Harmonic and Percussive Components in Speech Emotion Recognition

Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as to…

Sound · Computer Science 2023-12-19 David Hason Rudd , Huan Huo , Guandong Xu

MaskCycleGAN-VC: Learning Non-parallel Voice Conversion with Filling in Frames

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However,…

Sound · Computer Science 2021-02-26 Takuhiro Kaneko , Hirokazu Kameoka , Kou Tanaka , Nobukatsu Hojo

Music Enhancement via Image Translation and Vocoding

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality…

Sound · Computer Science 2022-04-29 Nikhil Kandpal , Oriol Nieto , Zeyu Jin

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

LatentVoiceGrad: Nonparallel Voice Conversion with Latent Diffusion/Flow-Matching Models

Previously, we introduced VoiceGrad, a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a score-based diffusion model. The concept involves training a score network to…

Sound · Computer Science 2025-09-11 Hirokazu Kameoka , Takuhiro Kaneko , Kou Tanaka , Yuto Kondo

Two-stage training method for Japanese electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion

Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a…

Sound · Computer Science 2022-10-20 Ding Ma , Lester Phillip Violeta , Kazuhiro Kobayashi , Tomoki Toda

Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient

Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications,…

Sound · Computer Science 2025-10-31 Rinku Sebastian , Simon O'Keefe , Martin Trefzer

Universal Adaptor: Converting Mel-Spectrograms Between Different Configurations for Speech Synthesis

Most recent speech synthesis systems are composed of a synthesizer and a vocoder. However, the existing synthesizers and vocoders can only be matched to acoustic features extracted with a specific configuration. Hence, we can't combine…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-01 Fan-Lin Wang , Po-chun Hsu , Da-rong Liu , Hung-yi Lee

Mel Spectrogram Inversion with Stable Pitch

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent…

Sound · Computer Science 2022-08-29 Bruno Di Giorgi , Mark Levy , Richard Sharp

SpecMix : A Mixed Sample Data Augmentation method for Training withTime-Frequency Domain Features

A mixed sample data augmentation strategy is proposed to enhance the performance of models on audio scene classification, sound event classification, and speech enhancement tasks. While there have been several augmentation methods shown to…

Sound · Computer Science 2021-08-09 Gwantae Kim , David K. Han , Hanseok Ko