Related papers: Crossmodal Voice Conversion

Cross-modal Face- and Voice-style Transfer

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Naoya Takahashi , Mayank K. Singh , Yuki Mitsufuji

Controlled AutoEncoders to Generate Faces from Voices

Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that…

Computer Vision and Pattern Recognition · Computer Science 2021-07-19 Hao Liang , Lulan Yu , Guikang Xu , Bhiksha Raj , Rita Singh

An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-18 Berrak Sisman , Junichi Yamagishi , Simon King , Haizhou Li

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and…

Sound · Computer Science 2024-12-12 Sowmya Cheripally

You said that?

We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method…

Computer Vision and Pattern Recognition · Computer Science 2017-07-19 Joon Son Chung , Amir Jamaludin , Andrew Zisserman

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the…

Sound · Computer Science 2020-10-14 Kun Zhou , Berrak Sisman , Mingyang Zhang , Haizhou Li

Voice Conversion with Diverse Intonation using Conditional Variational Auto-Encoder

Voice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different…

Sound · Computer Science 2025-04-17 Soobin Suh , Dabi Ahn , Heewoong Park , Jonghun Park

Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Se-Yun Um , Jihyun Kim , Jihyun Lee , Hong-Goo Kang

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the…

Sound · Computer Science 2024-09-05 Yan Rong , Li Liu

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

AdaptVC: High Quality Voice Conversion with Adaptive Learning

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and…

Sound · Computer Science 2025-01-15 Jaehun Kim , Ji-Hoon Kim , Yeunju Choi , Tan Dat Nguyen , Seongkyu Mun , Joon Son Chung

End-to-End Voice Conversion with Information Perturbation

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-16 Qicong Xie , Shan Yang , Yi Lei , Lei Xie , Dan Su

Voice-Face Cross-modal Matching and Retrieval: A Benchmark

Cross-modal associations between voice and face from a person can be learnt algorithmically, which can benefit a lot of applications. The problem can be defined as voice-face matching and retrieval tasks. Much research attention has been…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Chuyuan Xiong , Deyuan Zhang , Tao Liu , Xiaoyong Du

Reconstructing faces from voices

Voice profiling aims at inferring various human parameters from their speech, e.g. gender, age, etc. In this paper, we address the challenge posed by a subtask of voice profiling - reconstructing someone's face from their voice. The task is…

Sound · Computer Science 2019-06-04 Yandong Wen , Rita Singh , Bhiksha Raj

Direct Speech-to-image Translation

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing…

Multimedia · Computer Science 2020-07-15 Jiguo Li , Xinfeng Zhang , Chuanmin Jia , Jizheng Xu , Li Zhang , Yue Wang , Siwei Ma , Wen Gao

From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech

This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-14 Hyeong-Seok Choi , Changdae Park , Kyogu Lee

Transfer Learning from Speech Synthesis to Voice Conversion with Non-Parallel Training Data

This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. We first develop a multi-speaker speech synthesis system with…

Audio and Speech Processing · Electrical Eng. & Systems 2021-01-07 Mingyang Zhang , Yi Zhou , Li Zhao , Haizhou Li

Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-27 Minsu Kim , Pingchuan Ma , Honglie Chen , Stavros Petridis , Maja Pantic

Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping

Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Minki Kang , Wooseok Han , Eunho Yang

A Face-to-Face Neural Conversation Model

Neural networks have recently become good at engaging in dialog. However, current approaches are based solely on verbal text, lacking the richness of a real face-to-face conversation. We propose a neural conversation model that aims to read…

Computer Vision and Pattern Recognition · Computer Science 2018-12-05 Hang Chu , Daiqing Li , Sanja Fidler