Related papers: An Explainable Proxy Model for Multiabel Audio Seg…

Explainable by-design Audio Segmentation through Non-Negative Matrix Factorization and Probing

Audio segmentation is a key task for many speech technologies, most of which are based on neural networks, usually considered as black boxes, with high-level performances. However, in many domains, among which health or forensics, there is…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-21 Martin Lebourdais , Théo Mariotte , Antonio Almudévar , Marie Tahon , Alfonso Ortega

A Framework for Evaluating Faithfulness in Explainable AI for Machine Anomalous Sound Detection Using Frequency-Band Perturbation

Explainable AI (XAI) is commonly applied to anomalous sound detection (ASD) models to identify which time-frequency regions of an audio signal contribute to an anomaly decision. However, most audio explanations rely on qualitative…

Sound · Computer Science 2026-01-28 Alexander Buck , Georgina Cosma , Iain Phillips , Paul Conway , Patrick Baker

Cross-domain Voice Activity Detection with Self-Supervised Representations

Voice Activity Detection (VAD) aims at detecting speech segments on an audio signal, which is a necessary first step for many today's speech based applications. Current state-of-the-art methods focus on training a neural network exploiting…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-23 Sina Alisamir , Fabien Ringeval , Francois Portet

AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark

Explainable Artificial Intelligence (XAI) is targeted at understanding how models perform feature selection and derive their classification decisions. This paper explores post-hoc explanations for deep neural networks in the audio domain.…

Sound · Computer Science 2023-11-28 Sören Becker , Johanna Vielhaben , Marcel Ackermann , Klaus-Robert Müller , Sebastian Lapuschkin , Wojciech Samek

Quantitative Analysis of Proxy Tasks for Anomalous Sound Detection

Anomalous sound detection (ASD) typically involves self-supervised proxy tasks to learn feature representations from normal sound data, owing to the scarcity of anomalous samples. In ASD research, proxy tasks such as AutoEncoders operate…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-14 Seunghyeon Shin , Seokjin Lee

End-to-end speaker segmentation for overlap-aware resegmentation

Speaker segmentation consists in partitioning a conversation between one or more speakers into speaker turns. Usually addressed as the late combination of three sub-tasks (voice activity detection, speaker change detection, and overlapped…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-11 Hervé Bredin , Antoine Laurent

Explainability of CNN Based Classification Models for Acoustic Signal

Explainable Artificial Intelligence (XAI) has emerged as a critical tool for interpreting the predictions of complex deep learning models. While XAI has been increasingly applied in various domains within acoustics, its use in bioacoustics,…

Sound · Computer Science 2025-09-11 Zubair Faruqui , Mackenzie S. McIntire , Rahul Dubey , Jay McEntee

Symbolic Audio Classification via Modal Decision Tree Learning

The range of potential applications of acoustic analysis is wide. Classification of sounds, in particular, is a typical machine learning task that received a lot of attention in recent years. The most common approaches to sound…

Sound · Computer Science 2025-03-24 Enrico Marzano , Giovanni Pagliarini , Riccardo Pasini , Guido Sciavicco , Ionel Eduard Stan

Joint speech and overlap detection: a benchmark over multiple audio setup and speech domains

Voice activity and overlapped speech detection (respectively VAD and OSD) are key pre-processing tasks for speaker diarization. The final segmentation performance highly relies on the robustness of these sub-tasks. Recent studies have shown…

Sound · Computer Science 2023-07-26 Martin Lebourdais , Théo Mariotte , Marie Tahon , Anthony Larcher , Antoine Laurent , Silvio Montresor , Sylvain Meignier , Jean-Hugh Thomas

Biomimetic Frontend for Differentiable Audio Processing

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it…

Sound · Computer Science 2024-09-16 Ruolan Leslie Famularo , Dmitry N. Zotkin , Shihab A. Shamma , Ramani Duraiswami

Trainable Noise Model as an XAI evaluation method: application on Sobol for remote sensing image segmentation

eXplainable Artificial Intelligence (XAI) has emerged as an essential requirement when dealing with mission-critical applications, ensuring transparency and interpretability of the employed black box AI models. The significance of XAI spans…

Computer Vision and Pattern Recognition · Computer Science 2023-11-28 Hossein Shreim , Abdul Karim Gizzini , Ali J. Ghandour

Towards Robust Overlapping Speech Detection: A Speaker-Aware Progressive Approach Using WavLM

Overlapping Speech Detection (OSD) aims to identify regions where multiple speakers overlap in a conversation, a critical challenge in multi-party speech processing. This work proposes a speaker-aware progressive OSD model that leverages a…

Sound · Computer Science 2025-05-30 Zhaokai Sun , Li Zhang , Qing Wang , Pan Zhou , Lei Xie

Unsupervised Learning of Deep Features for Music Segmentation

Music segmentation refers to the dual problem of identifying boundaries between, and labeling, distinct music segments, e.g., the chorus, verse, bridge etc. in popular music. The performance of a range of music segmentation algorithms has…

Sound · Computer Science 2021-08-31 Matthew C. McCallum

Efficient Spoken Language Recognition via Multilabel Classification

Spoken language recognition (SLR) is the task of automatically identifying the language present in a speech signal. Existing SLR models are either too computationally expensive or too large to run effectively on devices with limited…

Computation and Language · Computer Science 2023-06-06 Oriol Nieto , Zeyu Jin , Franck Dernoncourt , Justin Salamon

Speaker Embedding-aware Neural Diarization: an Efficient Framework for Overlapping Speech Diarization in Meeting Scenarios

Overlapping speech diarization has been traditionally treated as a multi-label classification problem. In this paper, we reformulate this task as a single-label prediction problem by encoding multiple binary labels into a single label with…

Sound · Computer Science 2022-04-01 Zhihao Du , Shiliang Zhang , Siqi Zheng , Zhijie Yan

Explaining Speech Classification Models via Word-Level Audio Segments and Paralinguistic Features

Recent advances in eXplainable AI (XAI) have provided new insights into how models for vision, language, and tabular data operate. However, few approaches exist for understanding speech models. Existing work focuses on a few spoken language…

Computation and Language · Computer Science 2023-09-15 Eliana Pastor , Alkis Koudounas , Giuseppe Attanasio , Dirk Hovy , Elena Baralis

SAM Audio: Segment Anything in Audio

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed…

Audio and Speech Processing · Electrical Eng. & Systems 2025-12-24 Bowen Shi , Andros Tjandra , John Hoffman , Helin Wang , Yi-Chiao Wu , Luya Gao , Julius Richter , Matt Le , Apoorv Vyas , Sanyuan Chen , Christoph Feichtenhofer , Piotr Dollár , Wei-Ning Hsu , Ann Lee

Modulation Discovery with Differentiable Digital Signal Processing

Modulations are a critical part of sound design and music production, enabling the creation of complex and evolving audio. Modern synthesizers provide envelopes, low frequency oscillators (LFOs), and more parameter automation tools that…

Sound · Computer Science 2025-10-08 Christopher Mitcheltree , Hao Hao Tan , Joshua D. Reiss

Reasoning Beyond Majority Vote: An Explainable SpeechLM Framework for Speech Emotion Recognition

Speech Emotion Recognition (SER) is typically trained and evaluated on majority-voted labels, which simplifies benchmarking but masks subjectivity and provides little transparency into why predictions are made. This neglects valid minority…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-06 Bo-Hao Su , Hui-Ying Shih , Jinchuan Tian , Jiatong Shi , Chi-Chun Lee , Carlos Busso , Shinji Watanabe

Sound Explanation for Trustworthy Machine Learning

We take a formal approach to the explainability problem of machine learning systems. We argue against the practice of interpreting black-box models via attributing scores to input components due to inherently conflicting goals of…

Machine Learning · Computer Science 2023-06-13 Kai Jia , Pasapol Saowakon , Limor Appelbaum , Martin Rinard