Electrical Eng. & Systems

Frequency-Modulated and Single-Tone Excitation to Reveal Vibro-Acoustic Nonlinearities in Loosened Bolted Joints

Preload loss in bolted joints results in alterations of the stiffness, damping, and nonlinearity of the structure, but existing monitoring techniques for rail-vehicle systems are often not capable of combining controlled shaker tests and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Berkay Kullukcu , Robin Pianowski , Dina Hannebauer

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG)…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Heejoon Koo , Yoon Tae Kim , Miika Toikkanen , June-Woo Kim

MELD: Mel-Spectrogram-Based Speech Language Modeling with Discrete Latent Variables

Recent speech language models rely on encoders that are optimized separately from autoregressive models. Since these encoders are unaware of the downstream objectives, the extracted representations may not be optimal for downstream tasks.…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Sung-Lin Yeh , Wei Zhou , Gil Keren , Duc Le , Zhong Meng , Hao Tang , Jay Mahadeokar , Ozlem Kalinli , Alexandre Mourachko

Decoding Strategies for Diffusion-Based ASR: A Systematic Evaluation of Confidence-Based Thresholding

While LLM-based Automatic Speech Recognition (ASR) achieves high accuracy, its speed is limited by sequential autoregressive decoding. Diffusion Language Models (DLMs) offer a parallel alternative, yet their decoding strategies remain…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Jeong Hun Yeo , Minsu Kim , Hyeongseop Rha , Yong Man Ro

The WER Trap: Shattering the Illusion of Unified Tokens in Speech Language Models

The pursuit of a "unified" discrete token for both speech understanding and generation has led the Speech Language Model (SLM) community to heavily rely on Word Error Rate (WER) -- the core metric for Whisper-style tokenizers -- as the…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Xiangyu Zhang , Yuxin Li , Haoyang Zhang , Shiqi Han , Hexin Liu , Qiquan Zhang , Beena Ahmed , Julien Epps

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering the unknown…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Yanze Xu , Wenwu Wang , Mark D. Plumbley

FNH-TTS: Mixture-of-Experts Duration Modeling for Robust Neural Speech Synthesis

Current non-autoregressive (NAR) text-to-speech (TTS) systems still struggle to model diverse and speaker-dependent duration variation. We further observe that richer duration variation can increase the synthesis difficulty of existing…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-29 Qingliang Meng , Yuqing Deng , Wei Liang , Limei Yu , Huizhi Liang , Tian Li

Comprehensive Benchmarking of Long-Form Speech Generation in Diverse Scenarios

Recent advances in speech generation have enabled high-fidelity synthesis, yet systematic evaluation of models under long-context conditions remains largely underexplored. A comprehensive evaluation benchmark for long-form speech is…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Changhao Pan , Rui Yang , Han Wang , Zhuan Zhou , Xuming He , Wenxiang Guo , Ziyue Jiang , Ruiqi Li , Yu Zhang , Chenyuhao Wen , Ke Lei , Xiang Yin , Jingyu Lu , Zhiyuan Zhu , Zhou Zhao

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Yucheng Wang , Jing Peng , Hanqi Li , Chenghao Wang , Wenming Tu , Yu Xi , Zhaokai Sun , Kai Yu , Shuai Wang

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

Automatic deepfake detection has received considerable research attention, yet the socio-technical environment in which humans actually encounter synthetic speech remains poorly understood. We investigate voice deepfake detection as a…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Lelia Erscoi , Tomi Kinnunen

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Audio tokenizers are fundamental to unifying audio understanding and generation. Understanding requires high-level semantics, while generation demands semantic and acoustic details. Existing unified tokenizers jointly encode both in…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Zhisheng Zhang , Xiang Li , Yixuan Zhou , Jing Peng , Guoyang Zeng , Zhiyong Wu

FSD50K-Solo: Automated Curation of Single-Source Sound Events

High-quality training datasets are essential for the performance of neural networks. However, the audio domain still lacks a large-scale, strongly-labeled, and single-source sound event dataset. The FSD50K dataset, despite being relatively…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Ningyuan Yang , Sile Yin , Li-Chia Yang , Bryce Irvin , Xiao Quan , Marko Stamenovic , Shuo Zhang

VAANI: Capturing the language landscape for an inclusive digital India

Voice based technologies have the potential to bridge digital accessibility gaps; however, existing datasets fail to capture the linguistic and regional diversity of Indic languages. We present Project VAANI, a large scale multimodal…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Sujith Pulikodan , Abhayjeet Singh , Agneedh Basu , Nihar Desai , Pavan Kumar J , Pranav D Bhat , Raghu Dharmaraju , Ritika Gupta , Sathvik Udupa , Saurabh Kumar , Sumit Sharma , Visruth Sanka , Dinesh Tewari , Harsh Dhand , Amrita Kamat , Sukhwinder Singh , Shikhar Vashishth , Partha Talukdar , Raj Acharya , Prasanta Kumar Ghosh

Why Can't They Remember? Uncovering Representation and Retrieval Bottlenecks in Multi-Turn Acoustic Memory

Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech)…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-27 Yang Xiao , Siyi Wang , Han Yin , Hong Jia , Vidhyasaharan Sethu , Eun-Jung Holden , Ting Dang

CFMDCTCodec: A Low-Bitrate Neural Speech Codec with Noise-Prior-aware Conditional Flow Matching for MDCT-Spectral Enhancement

High-quality speech coding at low bitrates is crucial for bandwidth-constrained applications, yet remains challenging due to the severe loss of quality-critical information in highly compressed representations. To overcome this challenge,…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-27 Xiao-Hang Jiang , Yang Ai , Hui-Peng Du , Zhen-Hua Ling , Ji Wu

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

Ultra-low-bitrate speech coding is pivotal for bandwidth-constrained communication and deep compression, yet maintaining naturalness and speaker identity at such extreme bit budgets remains challenging due to pronounced information loss and…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-26 Hui-Peng Du , Yang Ai , Xiao-Hang Jiang , Yuan Tian , Zhen-Hua Ling

Decoding Stimulus Reconstruction-Based Auditory Attention Robustly in Unbalanced EEG Datasets

In the past decade, numerous studies have applied deep neural networks (DNNs) to decode auditory attention (AAD) from Electroencephalogram (EEG) signals via stimulus reconstruction. However, the influence of dataset balance on the decoding…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-26 Yuanming Zhang , Yayun Liang , Zhibin Lin , Jing Lu

cSTMM: A Unified Complex Spherical Student's $t$ Mixture Model for Directional Statistics in Mask-Based Blind Speech Separation

Mask-based blind speech separation (BSS) estimates source-wise time-frequency (TF) masks by clustering multichannel observations using spatial information. The directional statistical approach clusters normalized multichannel observations…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-26 Nobutaka Ito

WaveNeXt 2: ConvNeXt-Based Fast Neural Vocoders With Residual Denoising and Sub-Modeling for GAN and Diffusion Models

Most neural vocoders are limited to one type: either GAN or diffusion-based. While state-of-the-art models like Vocos and WaveNeXt use powerful ConvNeXt-based generators, they have only been used in GAN frameworks and have limited…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-26 Wangzixi Zhou , Takuma Okamoto , Yamato Ohtani , Sakriani Sakti , Hisashi Kawai

Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-26 Wangzixi Zhou , Bagus Tris Atmaja , Sakriani Sakti