Related papers: Listenable Maps for Audio Classifiers

LMAC-TD: Producing Time Domain Explanations for Audio Classifiers

Neural networks are typically black-boxes that remain opaque with regards to their decision mechanisms. Several works in the literature have proposed post-hoc explanation methods to alleviate this issue. This paper proposes LMAC-TD, a…

Sound · Computer Science 2024-09-16 Eleonora Mancini , Francesco Paissan , Mirco Ravanelli , Cem Subakan

Tackling Interpretability in Audio Classification Networks with Non-negative Matrix Factorization

This paper tackles two major problem settings for interpretability of audio processing networks, post-hoc and by-design interpretation. For post-hoc interpretation, we aim to interpret decisions of a network in terms of high-level audio…

Sound · Computer Science 2023-05-15 Jayneel Parekh , Sanjeel Parekh , Pavlo Mozharovskyi , Gaël Richard , Florence d'Alché-Buc

LLark: A Multimodal Instruction-Following Language Model for Music

Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal…

Sound · Computer Science 2024-06-04 Josh Gardner , Simon Durand , Daniel Stoller , Rachel M. Bittner

Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF

This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel…

Sound · Computer Science 2022-10-25 Jayneel Parekh , Sanjeel Parekh , Pavlo Mozharovskyi , Florence d'Alché-Buc , Gaël Richard

Toward Faithful Explanations in Acoustic Anomaly Detection

Interpretability is essential for user trust in real-world anomaly detection applications. However, deep learning models, despite their strong performance, often lack transparency. In this work, we study the interpretability of…

Sound · Computer Science 2026-01-21 Maab Elrashid , Anthony Deschênes , Cem Subakan , Mirco Ravanelli , Rémi Georges , Michael Morin

Masked Autoencoders that Listen

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio…

Sound · Computer Science 2023-01-13 Po-Yao Huang , Hu Xu , Juncheng Li , Alexei Baevski , Michael Auli , Wojciech Galuba , Florian Metze , Christoph Feichtenhofer

Audio-Mind: An Auditable Agentic Framework for Audio Understanding

Audio agents extend large audio-language models (LALMs) by decomposing audio questions into tool calls, intermediate evidence, and iterative reasoning steps. However, as LALMs become stronger, the key challenge shifts from enabling tool use…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-28 Yucheng Wang , Jing Peng , Hanqi Li , Chenghao Wang , Wenming Tu , Yu Xi , Zhaokai Sun , Kai Yu , Shuai Wang

A Test Statistic Estimation-based Approach for Establishing Self-interpretable CNN-based Binary Classifiers

Interpretability is highly desired for deep neural network-based classifiers, especially when addressing high-stake decisions in medical imaging. Commonly used post-hoc interpretability methods have the limitation that they can produce…

Image and Video Processing · Electrical Eng. & Systems 2024-01-04 Sourya Sengupta , Mark A. Anastasio

Transformation of audio embeddings into interpretable, concept-based representations

Advancements in audio neural networks have established state-of-the-art results on downstream audio tasks. However, the black-box structure of these models makes it difficult to interpret the information encoded in their internal audio…

Sound · Computer Science 2025-04-22 Alice Zhang , Edison Thomaz , Lie Lu

Fuse after Align: Improving Face-Voice Association Learning via Multimodal Encoder

Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive…

Computer Vision and Pattern Recognition · Computer Science 2024-04-16 Chong Peng , Liqiang He , Dan Su

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality…

Computation and Language · Computer Science 2025-05-30 Jaewoo Ahn , Heeseung Yun , Dayoon Ko , Gunhee Kim

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract…

Sound · Computer Science 2025-06-05 Aurian Quelennec , Pierre Chouteau , Geoffroy Peeters , Slim Essid

Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction

Explaining recommendations enables users to understand whether recommended items are relevant to their needs and has been shown to increase their trust in the system. More generally, if designing explainable machine learning models is key…

Machine Learning · Computer Science 2020-08-27 Darius Afchar , Romain Hennequin

Explainable Multi-Modal Deep Learning for Automatic Detection of Lung Diseases from Respiratory Audio Signals

Respiratory diseases remain major global health challenges, and traditional auscultation is often limited by subjectivity, environmental noise, and inter-clinician variability. This study presents an explainable multimodal deep learning…

Sound · Computer Science 2025-12-02 S M Asiful Islam Saky , Md Rashidul Islam , Md Saiful Arefin , Shahaba Alam

LEAF: A Learnable Frontend for Audio Classification

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental…

Sound · Computer Science 2021-01-22 Neil Zeghidour , Olivier Teboul , Félix de Chaumont Quitry , Marco Tagliasacchi

A Model You Can Hear: Audio Identification with Playable Prototypes

Machine learning techniques have proved useful for classifying and analyzing audio content. However, recent methods typically rely on abstract and high-dimensional representations that are difficult to interpret. Inspired by…

Sound · Computer Science 2022-08-08 Romain Loiseau , Baptiste Bouvier , Yann Teytaut , Elliot Vincent , Mathieu Aubry , Loic Landrieu

Attention Consistency for LLMs Explanation

Understanding the decision-making processes of large language models (LLMs) is essential for their trustworthy development and deployment. However, current interpretability methods often face challenges such as low resolution and high…

Computation and Language · Computer Science 2025-10-14 Tian Lan , Jinyuan Xu , Xue He , Jenq-Neng Hwang , Lei Li

Biomimetic Frontend for Differentiable Audio Processing

While models in audio and speech processing are becoming deeper and more end-to-end, they as a consequence need expensive training on large data, and are often brittle. We build on a classical model of human hearing and make it…

Sound · Computer Science 2024-09-16 Ruolan Leslie Famularo , Dmitry N. Zotkin , Shihab A. Shamma , Ramani Duraiswami

Focal Modulation Networks for Interpretable Sound Classification

The increasing success of deep neural networks has raised concerns about their inherent black-box nature, posing challenges related to interpretability and trust. While there has been extensive exploration of interpretation techniques in…

Sound · Computer Science 2024-02-07 Luca Della Libera , Cem Subakan , Mirco Ravanelli

VISION DIFFMASK: Faithful Interpretation of Vision Transformers with Differentiable Patch Masking

The lack of interpretability of the Vision Transformer may hinder its use in critical real-world applications despite its effectiveness. To overcome this issue, we propose a post-hoc interpretability method called VISION DIFFMASK, which…

Computer Vision and Pattern Recognition · Computer Science 2023-04-14 Angelos Nalmpantis , Apostolos Panagiotopoulos , John Gkountouras , Konstantinos Papakostas , Wilker Aziz