English
Related papers

Related papers: Crossmodal Voice Conversion

200 papers

Image-to-image translation and voice conversion enable the generation of a new facial image and voice while maintaining some of the semantics such as a pose in an image and linguistic content in audio, respectively. They can aid in the…

Computer Vision and Pattern Recognition · Computer Science 2023-03-02 Naoya Takahashi , Mayank K. Singh , Yuki Mitsufuji

Multiple studies in the past have shown that there is a strong correlation between human vocal characteristics and facial features. However, existing approaches generate faces simply from voice, without exploring the set of features that…

Computer Vision and Pattern Recognition · Computer Science 2021-07-19 Hao Liang , Lulan Yu , Guikang Xu , Bhiksha Raj , Rita Singh

Speaker identity is one of the important characteristics of human speech. In voice conversion, we change the speaker identity from one to another, while keeping the linguistic content unchanged. Voice conversion involves multiple speech…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-18 Berrak Sisman , Junichi Yamagishi , Simon King , Haizhou Li

This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and…

Sound · Computer Science 2024-12-12 Sowmya Cheripally

We present a method for generating a video of a talking face. The method takes as inputs: (i) still images of the target face, and (ii) an audio speech segment; and outputs a video of the target face lip synched with the audio. The method…

Computer Vision and Pattern Recognition · Computer Science 2017-07-19 Joon Son Chung , Amir Jamaludin , Andrew Zisserman

Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the…

Sound · Computer Science 2020-10-14 Kun Zhou , Berrak Sisman , Mingyang Zhang , Haizhou Li

Voice conversion is a task of synthesizing an utterance with target speaker's voice while maintaining linguistic information of the source utterance. While a speaker can produce varying utterances from a single script with different…

Sound · Computer Science 2025-04-17 Soobin Suh , Dabi Ahn , Heewoong Park , Jonghun Park

In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary…

Computer Vision and Pattern Recognition · Computer Science 2023-03-16 Se-Yun Um , Jihyun Kim , Jihyun Lee , Hong-Goo Kang

Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the…

Sound · Computer Science 2024-09-05 Yan Rong , Li Liu

This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

The goal of voice conversion is to transform the speech of a source speaker to sound like that of a reference speaker while preserving the original content. A key challenge is to extract disentangled linguistic content from the source and…

Sound · Computer Science 2025-01-15 Jaehun Kim , Ji-Hoon Kim , Yeunju Choi , Tan Dat Nguyen , Seongkyu Mun , Joon Son Chung

The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-16 Qicong Xie , Shan Yang , Yi Lei , Lei Xie , Dan Su

Cross-modal associations between voice and face from a person can be learnt algorithmically, which can benefit a lot of applications. The problem can be defined as voice-face matching and retrieval tasks. Much research attention has been…

Computer Vision and Pattern Recognition · Computer Science 2020-01-01 Chuyuan Xiong , Deyuan Zhang , Tao Liu , Xiaoyong Du

Voice profiling aims at inferring various human parameters from their speech, e.g. gender, age, etc. In this paper, we address the challenge posed by a subtask of voice profiling - reconstructing someone's face from their voice. The task is…

Sound · Computer Science 2019-06-04 Yandong Wen , Rita Singh , Bhiksha Raj

Direct speech-to-image translation without text is an interesting and useful topic due to the potential applications in human-computer interaction, art creation, computer-aided design. etc. Not to mention that many languages have no writing…

Multimedia · Computer Science 2020-07-15 Jiguo Li , Xinfeng Zhang , Chuanmin Jia , Jizheng Xu , Li Zhang , Yue Wang , Siwei Ma , Wen Gao

This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-14 Hyeong-Seok Choi , Changdae Park , Kyogu Lee

This paper presents a novel framework to build a voice conversion (VC) system by learning from a text-to-speech (TTS) synthesis system, that is called TTS-VC transfer learning. We first develop a multi-speaker speech synthesis system with…

Audio and Speech Processing · Electrical Eng. & Systems 2021-01-07 Mingyang Zhang , Yi Zhou , Li Zhao , Haizhou Li

This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image, and the characteristics of output speech (e.g., pace, noise level, distance, tone, place) can be controllable with…

Audio and Speech Processing · Electrical Eng. & Systems 2025-05-27 Minsu Kim , Pingchuan Ma , Honglie Chen , Stavros Petridis , Maja Pantic

Generating speech from a face image is crucial for developing virtual humans capable of interacting using their unique voices, without relying on pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a zero-shot…

Computer Vision and Pattern Recognition · Computer Science 2024-12-31 Minki Kang , Wooseok Han , Eunho Yang

Neural networks have recently become good at engaging in dialog. However, current approaches are based solely on verbal text, lacking the richness of a real face-to-face conversation. We propose a neural conversation model that aims to read…

Computer Vision and Pattern Recognition · Computer Science 2018-12-05 Hang Chu , Daiqing Li , Sanja Fidler
‹ Prev 1 2 3 10 Next ›