Related papers: Enhanced Voice Post Processing Using Voice Decoder…

Learning to Guide Decoding for Image Captioning

Recently, much advance has been made in image captioning, and an encoder-decoder framework has achieved outstanding performance for this task. In this paper, we propose an extension of the encoder-decoder framework by adding a component…

Computer Vision and Pattern Recognition · Computer Science 2018-04-04 Wenhao Jiang , Lin Ma , Xinpeng Chen , Hanwang Zhang , Wei Liu

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

Neural Vocoders as Speech Enhancers

Speech enhancement (SE) and neural vocoding are traditionally viewed as separate tasks. In this work, we observe them under a common thread: the rank behavior of these processes. This observation prompts two key questions: \textit{Can a…

Sound · Computer Science 2025-01-24 Andong Li , Zhihang Sun , Fengyuan Hao , Xiaodong Li , Chengshi Zheng

Enhancing Listened Speech Decoding from EEG via Parallel Phoneme Sequence Prediction

Brain-computer interfaces (BCI) offer numerous human-centered application possibilities, particularly affecting people with neurological disorders. Text or speech decoding from brain activities is a relevant domain that could augment the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-10 Jihwan Lee , Tiantian Feng , Aditya Kommineni , Sudarsana Reddy Kadiri , Shrikanth Narayanan

Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data

We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Yuhao Zhang , Chen Xu , Bojie Hu , Chunliang Zhang , Tong Xiao , Jingbo Zhu

Degrading Voice: A Comprehensive Overview of Robust Voice Conversion Through Input Manipulation

Identity, accent, style, and emotions are essential components of human speech. Voice conversion (VC) techniques process the speech signals of two input speakers and other modalities of auxiliary information such as prompts and emotion…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-09 Xining Song , Zhihua Wei , Rui Wang , Haixiao Hu , Yanxiang Chen , Meng Han

A Unified Model For Voice and Accent Conversion In Speech and Singing using Self-Supervised Learning and Feature Extraction

This paper presents a new voice conversion model capable of transforming both speaking and singing voices. It addresses key challenges in current systems, such as conveying emotions, managing pronunciation and accent changes, and…

Sound · Computer Science 2024-12-12 Sowmya Cheripally

Speaker Re-identification with Speaker Dependent Speech Enhancement

While the use of deep neural networks has significantly boosted speaker recognition performance, it is still challenging to separate speakers in poor acoustic environments. Here speech enhancement methods have traditionally allowed improved…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-28 Yanpei Shi , Qiang Huang , Thomas Hain

Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model

High-quality speech corpora are essential foundations for most speech applications. However, such speech data are expensive and limited since they are collected in professional recording environments. In this work, we propose an…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-11 Haoyu Li , Yang Ai , Junichi Yamagishi

Generic Speech Enhancement with Self-Supervised Representation Space Loss

Single-channel speech enhancement is utilized in various tasks to mitigate the effect of interfering signals. Conventionally, to ensure the speech enhancement performs optimally, the speech enhancement has needed to be tuned for each task.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-11 Hiroshi Sato , Tsubasa Ochiai , Marc Delcroix , Takafumi Moriya , Takanori Ashihara , Ryo Masumura

AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization

Recently, voice conversion (VC) has been widely studied. Many VC systems use disentangle-based learning techniques to separate the speaker and the linguistic content information from a speech signal. Subsequently, they convert the voice by…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-03 Yen-Hao Chen , Da-Yi Wu , Tsung-Han Wu , Hung-yi Lee

Revisiting Singing Voice Detection: a Quantitative Review and the Future Outlook

Since the vocal component plays a crucial role in popular music, singing voice detection has been an active research topic in music information retrieval. Although several proposed algorithms have shown high performances, we argue that…

Sound · Computer Science 2018-06-05 Kyungyun Lee , Keunwoo Choi , Juhan Nam

Efficient Speech Enhancement via Embeddings from Pre-trained Generative Audioencoders

Recent research has delved into speech enhancement (SE) approaches that leverage audio embeddings from pre-trained models, diverging from time-frequency masking or signal prediction techniques. This paper introduces an efficient and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-16 Xingwei Sun , Heinrich Dinkel , Yadong Niu , Linzhang Wang , Junbo Zhang , Jian Luan

Improving Visual Speech Enhancement Network by Learning Audio-visual Affinity with Multi-head Attention

Audio-visual speech enhancement system is regarded as one of promising solutions for isolating and enhancing speech of desired speaker. Typical methods focus on predicting clean speech spectrum via a naive convolution neural network based…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-01 Xinmeng Xu , Yang Wang , Jie Jia , Binbin Chen , Dejun Li

Improving Accent Conversion with Reference Encoder and End-To-End Text-To-Speech

Accent conversion (AC) transforms a non-native speaker's accent into a native accent while maintaining the speaker's voice timbre. In this paper, we propose approaches to improving accent conversion applicability, as well as quality. First…

Computation and Language · Computer Science 2020-05-20 Wenjie Li , Benlai Tang , Xiang Yin , Yushi Zhao , Wei Li , Kang Wang , Hao Huang , Yuxuan Wang , Zejun Ma

Audiovisual Singing Voice Separation

Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information…

Sound · Computer Science 2021-07-02 Bochen Li , Yuxuan Wang , Zhiyao Duan

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-16 Daniel Michelsanti , Zheng-Hua Tan , Shi-Xiong Zhang , Yong Xu , Meng Yu , Dong Yu , Jesper Jensen

Exploring Aligned Lyrics-Informed Singing Voice Separation

In this paper, we propose a method of utilizing aligned lyrics as additional information to improve the performance of singing voice separation. We have combined the highway network-based lyrics encoder into Open-unmix separation network…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-12 Chang-Bin Jeon , Hyeong-Seok Choi , Kyogu Lee

Interactive decoding of words from visual speech recognition models

This work describes an interactive decoding method to improve the performance of visual speech recognition systems using user input to compensate for the inherent ambiguity of the task. Unlike most phoneme-to-word decoding pipelines, which…

Computation and Language · Computer Science 2021-07-05 Brendan Shillingford , Yannis Assael , Misha Denil

Towards General-Purpose Text-Instruction-Guided Voice Conversion

This paper introduces a novel voice conversion (VC) model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice". Unlike traditional methods that rely on reference utterances to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-17 Chun-Yi Kuan , Chen An Li , Tsu-Yuan Hsu , Tse-Yang Lin , Ho-Lam Chung , Kai-Wei Chang , Shuo-yiin Chang , Hung-yi Lee