Related papers: A Deep-Bayesian Framework for Adaptive Speech Dura…

Bayesian adaptive learning to latent variables via Variational Bayes and Maximum a Posteriori

In this work, we aim to establish a Bayesian adaptive learning framework by focusing on estimating latent variables in deep neural network (DNN) models. Latent variables indeed encode both transferable distributional information and…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-26 Hu Hu , Sabato Marco Siniscalchi , Chin-Hui Lee

Attention and Encoder-Decoder based models for transforming articulatory movements at different speaking rates

While speaking at different rates, articulators (like tongue, lips) tend to move differently and the enunciations are also of different durations. In the past, affine transformation and DNN have been used to transform articulatory movements…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-21 Abhayjeet Singh , Aravind Illa , Prasanta Kumar Ghosh

Variational Bayesian Adaptive Learning of Deep Latent Variables for Acoustic Knowledge Transfer

In this work, we propose a novel variational Bayesian adaptive learning approach for cross-domain knowledge transfer to address acoustic mismatches between training and testing conditions, such as recording devices and environmental noise.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-28 Hu Hu , Sabato Marco Siniscalchi , Chao-Han Huck Yang , Chin-Hui Lee

Bayesian Learning for Deep Neural Network Adaptation

A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences. Speaker adaptation techniques play a vital role to reduce the mismatch. Model-based…

Sound · Computer Science 2024-06-17 Xurong Xie , Xunying Liu , Tan Lee , Lan Wang

Real-time speech enhancement with dynamic attention span

For real-time speech enhancement (SE) including noise suppression, dereverberation and acoustic echo cancellation, the time-variance of the audio signals becomes a severe challenge. The causality and memory usage limit that only the…

Audio and Speech Processing · Electrical Eng. & Systems 2023-02-22 Chengyu Zheng , Yuan Zhou , Xiulian Peng , Yuan Zhang , Yan Lu

Forward Attention in Sequence-to-sequence Acoustic Modelling for Speech Synthesis

This paper proposes a forward attention method for the sequenceto- sequence acoustic modeling of speech synthesis. This method is motivated by the nature of the monotonic alignment from phone sequences to acoustic sequences. Only the…

Computation and Language · Computer Science 2020-01-14 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai

DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration

Long-context understanding is crucial for many NLP applications, yet transformers struggle with efficiency due to the quadratic complexity of self-attention. Sparse attention methods alleviate this cost but often impose static, predefined…

Computation and Language · Computer Science 2025-06-16 Hanzhi Zhang , Heng Fan , Kewei Sha , Yan Huang , Yunhe Feng

Variable Attention Masking for Configurable Transformer Transducer Speech Recognition

This work studies the use of attention masking in transformer transducer based speech recognition for building a single configurable model for different deployment scenarios. We present a comprehensive set of experiments comparing fixed…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-19 Pawel Swietojanski , Stefan Braun , Dogan Can , Thiago Fraga da Silva , Arnab Ghoshal , Takaaki Hori , Roger Hsiao , Henry Mason , Erik McDermott , Honza Silovsky , Ruchir Travadi , Xiaodan Zhuang

Adaptively Aligned Image Captioning via Adaptive Attention Time

Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word,…

Computer Vision and Pattern Recognition · Computer Science 2020-01-07 Lun Huang , Wenmin Wang , Yaxian Xia , Jie Chen

Factorised Speaker-environment Adaptive Training of Conformer Speech Recognition Systems

Rich sources of variability in natural speech present significant challenges to current data intensive speech recognition technologies. To model both speaker and environment level diversity, this paper proposes a novel Bayesian factorised…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-27 Jiajun Deng , Guinan Li , Xurong Xie , Zengrui Jin , Mingyu Cui , Tianzi Wang , Shujie Hu , Mengzhe Geng , Xunying Liu

End-to-end Speech Recognition with Adaptive Computation Steps

In this paper, we present Adaptive Computation Steps (ACS) algo-rithm, which enables end-to-end speech recognition models to dy-namically decide how many frames should be processed to predict a linguistic output. The model that applies ACS…

Audio and Speech Processing · Electrical Eng. & Systems 2018-09-27 Mohan Li , Min Liu , Masanori Hattori

Adaptive Duration Model for Text Speech Alignment

Speech-to-text alignment is a critical component of neural text to speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line, while non-autoregressive end to end TTS models rely on…

Sound · Computer Science 2025-09-01 Junjie Cao

End-to-End Text-to-Speech using Latent Duration based on VQ-VAE

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-21 Yusuke Yasuda , Xin Wang , Junichi Yamagishi

RLS-Based Adaptive Dereverberation Tracing Abrupt Position Change of Target Speaker

Adaptive algorithm based on multi-channel linear prediction is an effective dereverberation method balancing well between the attenuation of the long-term reverberation and the dereverberated speech quality. However, the abrupt change of…

Audio and Speech Processing · Electrical Eng. & Systems 2018-08-24 Teng Xiang , Jing Lu , Kai Chen

Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation

Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-09 Tobias Raichle , Niels Edinger , Bin Yang

Attention Based Fully Convolutional Network for Speech Emotion Recognition

Speech emotion recognition is a challenging task for three main reasons: 1) human emotion is abstract, which means it is hard to distinguish; 2) in general, human emotion can only be detected in some specific moments during a long…

Sound · Computer Science 2019-05-03 Yuanyuan Zhang , Jun Du , Zirui Wang , Jianshu Zhang

Self-Attention Linguistic-Acoustic Decoder

The conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models like recurrent neural networks. Despite the good performance of…

Sound · Computer Science 2018-11-07 Santiago Pascual , Antonio Bonafonte , Joan Serrà

Binaural Speech Enhancement Using Deep Complex Convolutional Transformer Networks

Studies have shown that in noisy acoustic environments, providing binaural signals to the user of an assistive listening device may improve speech intelligibility and spatial awareness. This paper presents a binaural speech enhancement…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-11 Vikas Tokala , Eric Grinstein , Mike Brookes , Simon Doclo , Jesper Jensen , Patrick A. Naylor

Attentive Convolutional Neural Network based Speech Emotion Recognition: A Study on the Impact of Input Features, Signal Length, and Acted Speech

Speech emotion recognition is an important and challenging task in the realm of human-computer interaction. Prior work proposed a variety of models and feature sets for training a system. In this work, we conduct extensive experiments using…

Computation and Language · Computer Science 2017-06-05 Michael Neumann , Ngoc Thang Vu

AADNet: An End-to-End Deep Learning Model for Auditory Attention Decoding

Auditory attention decoding (AAD) is the process of identifying the attended speech in a multi-talker environment using brain signals, typically recorded through electroencephalography (EEG). Over the past decade, AAD has undergone…

Sound · Computer Science 2025-07-08 Nhan Duc Thanh Nguyen , Huy Phan , Simon Geirnaert , Kaare Mikkelsen , Preben Kidmose