Related papers: SelfVC: Voice Conversion With Iterative Refinement…

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Jialong Zuo , Shengpeng Ji , Minghui Fang , Mingze Li , Ziyue Jiang , Xize Cheng , Xiaoda Yang , Chen Feiyang , Xinyu Duan , Zhou Zhao

CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation

Better disentanglement of speech representation is essential to improve the quality of voice conversion. Recently contrastive learning is applied to voice conversion successfully based on speaker labels. However, the performance of model…

Sound · Computer Science 2023-11-16 Yimin Deng , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

Towards Improved Speech Recognition through Optimized Synthetic Data Generation

Supervised training of speech recognition models requires access to transcribed audio data, which often is not possible due to confidentiality issues. Our approach to this problem is to generate synthetic audio from a text-only corpus using…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-01 Yanis Perrin , Gilles Boulianne

Multi-task Voice Activated Framework using Self-supervised Learning

Self-supervised learning methods such as wav2vec 2.0 have shown promising results in learning speech representations from unlabelled and untranscribed speech data that are useful for speech recognition. Since these representations are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-22 Shehzeen Hussain , Van Nguyen , Shuhua Zhang , Erik Visser

Hidden-Markov-Model Based Speech Enhancement

The goal of this contribution is to use a parametric speech synthesis system for reducing background noise and other interferences from recorded speech signals. In a first step, Hidden Markov Models of the synthesis system are trained. Two…

Sound · Computer Science 2017-07-06 Daniel Dzibela , Armin Sehr

Hierarchical disentangled representation learning for singing voice conversion

Conventional singing voice conversion (SVC) methods often suffer from operating in high-resolution audio owing to a high dimensionality of data. In this paper, we propose a hierarchical representation learning that enables the learning of…

Sound · Computer Science 2021-04-27 Naoya Takahashi , Mayank Kumar Singh , Yuki Mitsufuji

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Ryuichi Yamamoto , Reo Yoneyama , Lester Phillip Violeta , Wen-Chin Huang , Tomoki Toda

Model as Loss: A Self-Consistent Training Paradigm

Conventional methods for speech enhancement rely on handcrafted loss functions (e.g., time or frequency domain losses) or deep feature losses (e.g., using WavLM or wav2vec), which often fail to capture subtle signal properties essential for…

Sound · Computer Science 2025-05-28 Saisamarth Rajesh Phaye , Milos Cernak , Andrew Harper

Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model

This paper proposes a zero-shot text-to-speech (TTS) conditioned by a self-supervised speech-representation model acquired through self-supervised learning (SSL). Conventional methods with embedding vectors from x-vector or global style…

Sound · Computer Science 2023-12-19 Kenichi Fujita , Takanori Ashihara , Hiroki Kanagawa , Takafumi Moriya , Yusuke Ijima

Investigation of Using Disentangled and Interpretable Representations for One-shot Cross-lingual Voice Conversion

We study the problem of cross-lingual voice conversion in non-parallel speech corpora and one-shot learning setting. Most prior work require either parallel speech corpora or enough amount of training data from a target speaker. However, we…

Sound · Computer Science 2018-08-17 Seyed Hamidreza Mohammadi , Taehwan Kim

A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion

The goal of voice conversion is to transform source speech into a target voice, keeping the content unchanged. In this paper, we focus on self-supervised representation learning for voice conversion. Specifically, we compare discrete and…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-09 Benjamin van Niekerk , Marc-André Carbonneau , Julian Zaïdi , Mathew Baas , Hugo Seuté , Herman Kamper

Creating Personalized Synthetic Voices from Post-Glossectomy Speech with Guided Diffusion Models

This paper is about developing personalized speech synthesis systems with recordings of mildly impaired speech. In particular, we consider consonant and vowel alterations resulted from partial glossectomy, the surgical removal of part of…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-10 Yusheng Tian , Guangyan Zhang , Tan Lee

PMVC: Data Augmentation-Based Prosody Modeling for Expressive Voice Conversion

Voice conversion as the style transfer task applied to speech, refers to converting one person's speech into a new speech that sounds like another person's. Up to now, there has been a lot of research devoted to better implementation of VC…

Sound · Computer Science 2023-08-23 Yimin Deng , Huaizhen Tang , Xulong Zhang , Jianzong Wang , Ning Cheng , Jing Xiao

Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters

Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important…

Sound · Computer Science 2025-07-08 Mathilde Abrassart , Nicolas Obin , Axel Roebel

A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation

By representing speaker characteristic as a single fixed-length vector extracted solely from speech, we can train a neural multi-speaker speech synthesis model by conditioning the model on those vectors. This model can also be adapted to…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-09 Hieu-Thi Luong , Junichi Yamagishi

Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques

Recent works on voice conversion (VC) focus on preserving the rhythm and the intonation as well as the linguistic content. To preserve these features from the source, we decompose current non-parallel VC systems into two encoders and one…

Audio and Speech Processing · Electrical Eng. & Systems 2021-10-12 Kang-wook Kim , Seung-won Park , Junhyeok Lee , Myun-chul Joe

SEF-VC: Speaker Embedding Free Zero-Shot Voice Conversion with Cross Attention

Zero-shot voice conversion (VC) aims to transfer the source speaker timbre to arbitrary unseen target speaker timbre, while keeping the linguistic content unchanged. Although the voice of generated speech can be controlled by providing the…

Sound · Computer Science 2024-01-31 Junjie Li , Yiwei Guo , Xie Chen , Kai Yu

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

In This Environment, As That Speaker: A Text-Driven Framework for Multi-Attribute Speech Conversion

We propose TES-VC (Text-driven Environment and Speaker controllable Voice Conversion), a text-driven voice conversion framework with independent control of speaker timbre and environmental acoustics. TES-VC processes simultaneous text…

Sound · Computer Science 2025-06-16 Jiawei Jin , Zhihan Yang , Yixuan Zhou , Zhiyong Wu

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre…

Sound · Computer Science 2025-03-30 Xueyao Zhang , Xiaohui Zhang , Kainan Peng , Zhenyu Tang , Vimal Manohar , Yingru Liu , Jeff Hwang , Dangna Li , Yuhao Wang , Julian Chan , Yuan Huang , Zhizheng Wu , Mingbo Ma