Related papers: ADIFF: Explaining audio difference using natural l…

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-18 Tatsuya Komatsu , Yusuke Fujita , Kazuya Takeda , Tomoki Toda

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-24 Daiki Takeuchi , Yasunori Ohishi , Daisuke Niizumi , Noboru Harada , Kunio Kashino

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou

Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features

Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with…

Sound · Computer Science 2024-09-11 Zahra Khanjani , Tolulope Ale , Jianwu Wang , Lavon Davis , Christine Mallinson , Vandana P. Janeja

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to…

Sound · Computer Science 2022-04-19 Yiming Zhang , Hong Yu , Ruoyi Du , Zhanyu Ma , Yuan Dong

DistinctAD: Distinctive Audio Description Generation in Contexts

Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Bo Fang , Wenhao Wu , Qiangqiang Wu , Yuxin Song , Antoni B. Chan

DEAF: A Benchmark for Diagnostic Evaluation of Acoustic Faithfulness in Audio Language Models

Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To…

Artificial Intelligence · Computer Science 2026-03-23 Jiaqi Xiong , Yunjia Qi , Qi Cao , Yu Zheng , Yutong Zhang , Ziteng Wang , Ruofan Liao , Weisheng Xu , Sichen Liu

Emotion and Acoustics Should Agree: Cross-Level Inconsistency Analysis for Audio Deepfake Detection

Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or…

Sound · Computer Science 2026-01-21 Jinhua Zhang , Zhenqi Jia , Rui Liu

Learning Audio Concepts from Counterfactual Natural Language

Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language.…

Multimedia · Computer Science 2024-01-11 Ali Vosoughi , Luca Bondi , Ho-Hsiang Wu , Chenliang Xu

From Contrast to Commonality: Audio Commonality Captioning for Enhanced Audio-Text Cross-modal Understanding in Multimodal LLMs

Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of Multimodal LLMs (MLLMs). To strengthen this alignment, recent works propose Audio Difference Captioning…

Sound · Computer Science 2026-01-27 Yuhang Jia , Xu Zhang , Yujie Guo , Yang Chen , Shiwan Zhao

RadDiff: Describing Differences in Radiology Image Sets with Natural Language

Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-01-08 Xiaoxian Shen , Yuhui Zhang , Sahithi Ankireddy , Xiaohan Wang , Maya Varma , Henry Guo , Curtis Langlotz , Serena Yeung-Levy

Exploration of Audio Quality Assessment and Anomaly Localisation Using Attention Models

Many applications of speech technology require more and more audio data. Automatic assessment of the quality of the collected recordings is important to ensure they meet the requirements of the related applications. However, effective and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Qiang Huang , Thomas Hain

ADx3: A Collaborative Workflow for High-Quality Accessible Audio Description

Audio description (AD) makes video content accessible to blind and low-vision (BLV) audiences, but producing high-quality descriptions is resource-intensive. Automated AD offers scalability, and prior studies show human-in-the-loop editing…

Human-Computer Interaction · Computer Science 2026-02-04 Lana Do , Shasta Ihorn , Charity Pitcher-Cooper , Juvenal Francisco Barajas , Gio Jung , Xuan Duy Anh Nguyen , Sanjay Mirani , Ilmi Yoon

DIFFA: Large Language Diffusion Models Can Listen and Understand

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm,…

Sound · Computer Science 2025-11-11 Jiaming Zhou , Hongjie Chen , Shiwan Zhao , Jian Kang , Jie Li , Enzhi Wang , Yujie Guo , Haoqin Sun , Hui Wang , Aobo Kong , Yong Qin , Xuelong Li

Audio Entailment: Assessing Deductive Reasoning for Audio Understanding

Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval,…

Sound · Computer Science 2024-07-26 Soham Deshmukh , Shuo Han , Hazim Bukhari , Benjamin Elizalde , Hannes Gamper , Rita Singh , Bhiksha Raj

ATIR: Towards Audio-Text Interleaved Contextual Retrieval

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval…

Sound · Computer Science 2026-04-23 Tong Zhao , Chenghao Zhang , Yutao Zhu , Zhicheng Dou

AUDITA: A New Dataset to Audit Humans vs. AI Skill at Audio QA

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific…

Computation and Language · Computer Science 2026-04-24 Tasnim Kabir , Dmytro Kurdydyk , Aadi Palnitkar , Liam Dorn , Ahmed Haj Ahmed , Jordan Lee Boyd-Graber

Unveiling and Mitigating Bias in Audio Visual Segmentation

Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally…

Computer Vision and Pattern Recognition · Computer Science 2024-07-24 Peiwen Sun , Honggang Zhang , Di Hu

ALCAP: Alignment-Augmented Music Captioner

Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the…

Sound · Computer Science 2023-10-24 Zihao He , Weituo Hao , Wei-Tsung Lu , Changyou Chen , Kristina Lerman , Xuchen Song

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved…

Sound · Computer Science 2024-08-01 Sreyan Ghosh , Ashish Seth , Sonal Kumar , Utkarsh Tyagi , Chandra Kiran Evuru , S. Ramaneswaran , S. Sakshi , Oriol Nieto , Ramani Duraiswami , Dinesh Manocha