English
Related papers

Related papers: SelfVC: Voice Conversion With Iterative Refinement…

200 papers

Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Jialong Zuo , Shengpeng Ji , Minghui Fang , Mingze Li , Ziyue Jiang , Xize Cheng , Xiaoda Yang , Chen Feiyang , Xinyu Duan , Zhou Zhao

Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model…

Sound · Computer Science 2023-11-16 Yimin Deng , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

Supervised training of speech recognition models requires access to transcribed audio data, which often is not possible due to confidentiality issues. Our approach to this problem is to generate synthetic audio from a text-only corpus using…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-01 Yanis Perrin , Gilles Boulianne

Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-22 Shehzeen Hussain , Van Nguyen , Shuhua Zhang , Erik Visser

The goal of this contribution is to use a parametric speech synthesis system for reducing background noise and other interferences from recorded speech signals. In a first step, Hidden Markov Models of the synthesis system are trained. Two…

Sound · Computer Science 2017-07-06 Daniel Dzibela , Armin Sehr

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of…

Sound · Computer Science 2021-04-27 Naoya Takahashi , Mayank Kumar Singh , Yuki Mitsufuji

This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Ryuichi Yamamoto , Reo Yoneyama , Lester Phillip Violeta , Wen-Chin Huang , Tomoki Toda

Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for…

Sound · Computer Science 2025-05-28 Saisamarth Rajesh Phaye , Milos Cernak , Andrew Harper

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style…

Sound · Computer Science 2023-12-19 Kenichi Fujita , Takanori Ashihara , Hiroki Kanagawa , Takafumi Moriya , Yusuke Ijima

We study the problem of cross-lingual voice conversion in non-parallel speech corpora and one-shot learning setting. Most prior work require either parallel speech corpora or enough amount of training data from a target speaker. However, we…

Sound · Computer Science 2018-08-17 Seyed Hamidreza Mohammadi , Taehwan Kim

The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-09 Benjamin van Niekerk , Marc-André Carbonneau , Julian Zaïdi , Mathew Baas , Hugo Seuté , Herman Kamper

This paper is about developing personalized speech synthesis systems with recordings of mildly impaired speech. In particular, we consider consonant and vowel alterations resulted from partial glossectomy, the surgical removal of part of…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Yusheng Tian , Guangyan Zhang , Tan Lee

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC…

Sound · Computer Science 2023-08-23 Yimin Deng , Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important…

Sound · Computer Science 2025-07-08 Mathilde Abrassart , Nicolas Obin , Axel Roebel

By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-09 Hieu-Thi Luong , Junichi Yamagishi

Recent works on voice conversion (VC) focus on preserving the rhythm and the intonation as well as the linguistic content. To preserve these features from the source, we decompose current non-parallel VC systems into two encoders and one…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-12 Kang-wook Kim , Seung-won Park , Junhyeok Lee , Myun-chul Joe

Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the…

Sound · Computer Science 2024-01-31 Junjie Li , Yiwei Guo , Xie Chen , Kai Yu

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text…

Sound · Computer Science 2025-06-16 Jiawei Jin , Zhihan Yang , Yixuan Zhou , Zhiyong Wu

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre…