English
Related papers

Related papers: Mel-spectrogram augmentation for sequence to seque…

200 papers

In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and…

Sound · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Juan Liu , Yuan Jiang , Li-Rong Dai

Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its…

Sound · Computer Science 2024-07-11 Guoqiang Hu , Huaning Tan , Ruilai Li

This paper presents methods of making using of text supervision to improve the performance of sequence-to-sequence (seq2seq) voice conversion. Compared with conventional frame-to-frame voice conversion approaches, the seq2seq acoustic…

Sound · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Yuan Jiang , Li-Juan Liu , Chen Liang , Li-Rong Dai

Non-parallel voice conversion (VC) is a technique for learning mappings between source and target speeches without using a parallel corpus. Recently, cycle-consistent adversarial network (CycleGAN)-VC and CycleGAN-VC2 have shown promising…

Sound · Computer Science 2020-10-23 Takuhiro Kaneko , Hirokazu Kameoka , Kou Tanaka , Nobukatsu Hojo

In this work, we propose Mel-FullSubNet, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. Mel-FullSubNet takes as input the noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-23 Rui Zhou , Xian Li , Ying Fang , Xiaofei Li

This paper introduces a cepstrum-based pitch modification method that can be applied to any mel-spectrogram representation. As a result, this method is compatible with any mel-based vocoder without requiring any additional training or…

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior…

Sound · Computer Science 2017-08-08 Hiroyuki Miyoshi , Yuki Saito , Shinnosuke Takamichi , Hiroshi Saruwatari

In speech synthesis and speech enhancement systems, melspectrograms need to be precise in acoustic representations. However, the generated spectrograms are over-smooth, that could not produce high quality synthesized speech. Inspired by…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-04 Leyuan Sheng , Dong-Yan Huang , Evgeniy N. Pavlovskiy

In this work, we propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. The proposed network takes as input the noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-31 Nian Shao , Rui Zhou , Pengyu Wang , Xian Li , Ying Fang , Yujie Yang , Xiaofei Li

When convolutional neural networks are used to tackle learning problems based on music or, more generally, time series data, raw one-dimensional data are commonly pre-processed to obtain spectrogram or mel-spectrogram coefficients, which…

Machine Learning · Computer Science 2018-09-20 Monika Doerfler , Thomas Grill , Roswitha Bammer , Arthur Flexer

Speech Emotion Recognition (SER) affective technology enables the intelligent embedded devices to interact with sensitivity. Similarly, call centre employees recognise customers' emotions from their pitch, energy, and tone of voice so as to…

Sound · Computer Science 2023-12-19 David Hason Rudd , Huan Huo , Guandong Xu

Non-parallel voice conversion (VC) is a technique for training voice converters without a parallel corpus. Cycle-consistent adversarial network-based VCs (CycleGAN-VC and CycleGAN-VC2) are widely accepted as benchmark methods. However,…

Sound · Computer Science 2021-02-26 Takuhiro Kaneko , Hirokazu Kameoka , Kou Tanaka , Nobukatsu Hojo

Consumer-grade music recordings such as those captured by mobile devices typically contain distortions in the form of background noise, reverb, and microphone-induced EQ. This paper presents a deep learning approach to enhance low-quality…

Sound · Computer Science 2022-04-29 Nikhil Kandpal , Oriol Nieto , Zeyu Jin

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

Previously, we introduced VoiceGrad, a nonparallel voice conversion (VC) technique enabling mel-spectrogram conversion from source to target speakers using a score-based diffusion model. The concept involves training a score network to…

Sound · Computer Science 2025-09-11 Hirokazu Kameoka , Takuhiro Kaneko , Kou Tanaka , Yuto Kondo

Sequence-to-sequence (seq2seq) voice conversion (VC) models have greater potential in converting electrolaryngeal (EL) speech to normal speech (EL2SP) compared to conventional VC models. However, EL2SP based on seq2seq VC requires a…

Sound · Computer Science 2022-10-20 Ding Ma , Lester Phillip Violeta , Kazuhiro Kobayashi , Tomoki Toda

Extracting features from the speech is the most critical process in speech signal processing. Mel Frequency Cepstral Coefficients (MFCC) are the most widely used features in the majority of the speaker and speech recognition applications,…

Sound · Computer Science 2025-10-31 Rinku Sebastian , Simon O'Keefe , Martin Trefzer

Most recent speech synthesis systems are composed of a synthesizer and a vocoder. However, the existing synthesizers and vocoders can only be matched to acoustic features extracted with a specific configuration. Hence, we can't combine…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-01 Fan-Lin Wang , Po-chun Hsu , Da-rong Liu , Hung-yi Lee

Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform. Modern speech generation pipelines use a vocoder as their final component. Recent…

Sound · Computer Science 2022-08-29 Bruno Di Giorgi , Mark Levy , Richard Sharp

A mixed sample data augmentation strategy is proposed to enhance the performance of models on audio scene classification, sound event classification, and speech enhancement tasks. While there have been several augmentation methods shown to…

Sound · Computer Science 2021-08-09 Gwantae Kim , David K. Han , Hanseok Ko
‹ Prev 1 2 3 10 Next ›