Related papers: Pretraining Techniques for Sequence-to-Sequence Vo…

Voice Transformer Network: Sequence-to-Sequence Voice Conversion Using Transformer with Text-to-Speech Pretraining

We introduce a novel sequence-to-sequence (seq2seq) voice conversion (VC) model based on the Transformer architecture with text-to-speech (TTS) pretraining. Seq2seq VC models are attractive owing to their ability to convert prosody. While…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-17 Wen-Chin Huang , Tomoki Hayashi , Yi-Chiao Wu , Hirokazu Kameoka , Tomoki Toda

The Sequence-to-Sequence Baseline for the Voice Conversion Challenge 2020: Cascading ASR and TTS

This paper presents the sequence-to-sequence (seq2seq) baseline system for the voice conversion challenge (VCC) 2020. We consider a naive approach for voice conversion (VC), which is to first transcribe the input speech with an automatic…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-07 Wen-Chin Huang , Tomoki Hayashi , Shinji Watanabe , Tomoki Toda

AttS2S-VC: Sequence-to-Sequence Voice Conversion with Attention and Context Preservation Mechanisms

This paper describes a method based on a sequence-to-sequence learning (Seq2Seq) with attention and context preservation mechanism for voice conversion (VC) tasks. Seq2Seq has been outstanding at numerous tasks involving sequence modeling…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-13 Kou Tanaka , Hirokazu Kameoka , Takuhiro Kaneko , Nobukatsu Hojo

Non-autoregressive sequence-to-sequence voice conversion

This paper proposes a novel voice conversion (VC) method based on non-autoregressive sequence-to-sequence (NAR-S2S) models. Inspired by the great success of NAR-S2S models such as FastSpeech in text-to-speech (TTS), we extend the…

Sound · Computer Science 2021-04-15 Tomoki Hayashi , Wen-Chin Huang , Kazuhiro Kobayashi , Tomoki Toda

On Prosody Modeling for ASR+TTS based Voice Conversion

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic…

Sound · Computer Science 2021-07-21 Wen-Chin Huang , Tomoki Hayashi , Xinjian Li , Shinji Watanabe , Tomoki Toda

A Comparative Study on Transformer vs RNN in Speech Applications

Sequence-to-sequence models have been widely used in end-to-end speech processing, for example, automatic speech recognition (ASR), speech translation (ST), and text-to-speech (TTS). This paper focuses on an emergent sequence-to-sequence…

Computation and Language · Computer Science 2021-06-10 Shigeki Karita , Nanxin Chen , Tomoki Hayashi , Takaaki Hori , Hirofumi Inaguma , Ziyan Jiang , Masao Someki , Nelson Enrique Yalta Soplin , Ryuichi Yamamoto , Xiaofei Wang , Shinji Watanabe , Takenori Yoshimura , Wangyou Zhang

Many-to-Many Voice Transformer Network

This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework, which enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech. We previously…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-10 Hirokazu Kameoka , Wen-Chin Huang , Kou Tanaka , Takuhiro Kaneko , Nobukatsu Hojo , Tomoki Toda

ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion

This paper proposes a voice conversion (VC) method using sequence-to-sequence (seq2seq or S2S) learning, which flexibly converts not only the voice characteristics but also the pitch contour and duration of input speech. The proposed…

Sound · Computer Science 2020-10-08 Hirokazu Kameoka , Kou Tanaka , Damian Kwasny , Takuhiro Kaneko , Nobukatsu Hojo

Any-to-One Sequence-to-Sequence Voice Conversion using Self-Supervised Discrete Speech Representations

We present a novel approach to any-to-one (A2O) voice conversion (VC) in a sequence-to-sequence (seq2seq) framework. A2O VC aims to convert any speaker, including those unseen during training, to a fixed target speaker. We utilize…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-26 Wen-Chin Huang , Yi-Chiao Wu , Tomoki Hayashi , Tomoki Toda

Emotional Voice Conversion using Multitask Learning with Text-to-speech

Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic…

Audio and Speech Processing · Electrical Eng. & Systems 2019-11-28 Tae-Ho Kim , Sungjae Cho , Shinkook Choi , Sejik Park , Soo-Young Lee

Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities

Voice conversion (VC) using sequence-to-sequence learning of context posterior probabilities is proposed. Conventional VC using shared context posterior probabilities predicts target speech parameters from the context posterior…

Sound · Computer Science 2017-08-08 Hiroyuki Miyoshi , Yuki Saito , Shinnosuke Takamichi , Hiroshi Saruwatari

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

We present a voice conversion solution using recurrent sequence to sequence modeling for DNNs. Our solution takes advantage of recent advances in attention based modeling in the fields of Neural Machine Translation (NMT), Text-to-Speech…

Audio and Speech Processing · Electrical Eng. & Systems 2019-07-19 Praveen Narayanan , Punarjay Chakravarty , Francois Charette , Gint Puskorius

Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers

End-to-end speech recognition is a promising technology for enabling compact automatic speech recognition (ASR) systems since it can unify the acoustic and language model into a single neural network. However, as a drawback, training of…

Computation and Language · Computer Science 2022-02-17 Yotaro Kubo , Shigeki Karita , Michiel Bacchiani

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

In this paper, a neural network named Sequence-to-sequence ConvErsion NeTwork (SCENT) is presented for acoustic modeling in voice conversion. At training stage, a SCENT model is estimated by aligning the feature sequences of source and…

Sound · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Juan Liu , Yuan Jiang , Li-Rong Dai

An Adapter Based Pre-Training for Efficient and Scalable Self-Supervised Speech Representation Learning

We present a method for transferring pre-trained self-supervised (SSL) speech representations to multiple languages. There is an abundance of unannotated speech, so creating self-supervised representations from raw audio and fine-tuning on…

Audio and Speech Processing · Electrical Eng. & Systems 2022-02-08 Samuel Kessler , Bethan Thomas , Salah Karout

Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages

Recently, sequence-to-sequence models with attention have been successfully applied in Text-to-speech (TTS). These models can generate near-human speech with a large accurately-transcribed speech corpus. However, preparing such a large…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-12 Haitong Zhang , Yue Lin

Context-Aware Sequence-to-Sequence Models for Conversational Systems

This work proposes a novel approach based on sequence-to-sequence (seq2seq) models for context-aware conversational systems. Exist- ing seq2seq models have been shown to be good for generating natural responses in a data-driven…

Computation and Language · Computer Science 2018-05-23 Silje Christensen , Simen Johnsrud , Massimiliano Ruocco , Heri Ramampiaro

Non-Parallel Sequence-to-Sequence Voice Conversion with Disentangled Linguistic and Speaker Representations

This paper presents a method of sequence-to-sequence (seq2seq) voice conversion using non-parallel training data. In this method, disentangled linguistic and speaker representations are extracted from acoustic features, and voice conversion…

Audio and Speech Processing · Electrical Eng. & Systems 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages

We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech…

Computation and Language · Computer Science 2022-05-03 Felix Wu , Kwangyoun Kim , Shinji Watanabe , Kyu Han , Ryan McDonald , Kilian Q. Weinberger , Yoav Artzi

Improving speech recognition models with small samples for air traffic control systems

In the domain of air traffic control (ATC) systems, efforts to train a practical automatic speech recognition (ASR) model always faces the problem of small training samples since the collection and annotation of speech samples are expert-…

Sound · Computer Science 2021-02-17 Yi Lin , Qin Li , Bo Yang , Zhen Yan , Huachun Tan , Zhengmao Chen