English
Related papers

Related papers: ADIFF: Explaining audio difference using natural l…

200 papers

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-18 Tatsuya Komatsu , Yusuke Fujita , Kazuya Takeda , Tomoki Toda

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-24 Daiki Takeuchi , Yasunori Ohishi , Daisuke Niizumi , Noboru Harada , Kunio Kashino

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou

Several types of spoofed audio, such as mimicry, replay attacks, and deepfakes, have created societal challenges to information integrity. Recently, researchers have worked with sociolinguistics experts to label spoofed audio samples with…

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to…

Sound · Computer Science 2022-04-19 Yiming Zhang , Hong Yu , Ruoyi Du , Zhanyu Ma , Yuan Dong

Audio Descriptions (ADs) aim to provide a narration of a movie in text form, describing non-dialogue-related narratives, such as characters, actions, or scene establishment. Automatic generation of ADs remains challenging due to: i) the…

Computer Vision and Pattern Recognition · Computer Science 2024-11-28 Bo Fang , Wenhao Wu , Qiangqiang Wu , Yuxin Song , Antoni B. Chan

Recent Audio Multimodal Large Language Models (Audio MLLMs) demonstrate impressive performance on speech benchmarks, yet it remains unclear whether these models genuinely process acoustic signals or rely on text-based semantic inference. To…

Artificial Intelligence · Computer Science 2026-03-23 Jiaqi Xiong , Yunjia Qi , Qi Cao , Yu Zheng , Yutong Zhang , Ziteng Wang , Ruofan Liao , Weisheng Xu , Sichen Liu

Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or…

Sound · Computer Science 2026-01-21 Jinhua Zhang , Zhenqi Jia , Rui Liu

Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language.…

Multimedia · Computer Science 2024-01-11 Ali Vosoughi , Luca Bondi , Ho-Hsiang Wu , Chenliang Xu

Audio Captioning (AC) plays a pivotal role in enhancing audio-text cross-modal understanding during the pretraining and finetuning of Multimodal LLMs (MLLMs). To strengthen this alignment, recent works propose Audio Difference Captioning…

Sound · Computer Science 2026-01-27 Yuhang Jia , Xu Zhang , Yujie Guo , Yang Chen , Shiwan Zhao

Understanding how two radiology image sets differ is critical for generating clinical insights and for interpreting medical AI systems. We introduce RadDiff, a multimodal agentic system that performs radiologist-style comparative reasoning…

Computer Vision and Pattern Recognition · Computer Science 2026-01-08 Xiaoxian Shen , Yuhui Zhang , Sahithi Ankireddy , Xiaohan Wang , Maya Varma , Henry Guo , Curtis Langlotz , Serena Yeung-Levy

Many applications of speech technology require more and more audio data. Automatic assessment of the quality of the collected recordings is important to ensure they meet the requirements of the related applications. However, effective and…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Qiang Huang , Thomas Hain

Audio description (AD) makes video content accessible to blind and low-vision (BLV) audiences, but producing high-quality descriptions is resource-intensive. Automated AD offers scalability, and prior studies show human-in-the-loop editing…

Human-Computer Interaction · Computer Science 2026-02-04 Lana Do , Shasta Ihorn , Charity Pitcher-Cooper , Juvenal Francisco Barajas , Gio Jung , Xuan Duy Anh Nguyen , Sanjay Mirani , Ilmi Yoon

Recent advances in large language models (LLMs) have shown remarkable capabilities across textual and multimodal domains. In parallel, diffusion-based language models have emerged as a promising alternative to the autoregressive paradigm,…

Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval,…

Audio carries richer information than text, including emotion, speaker traits, and environmental context, while also enabling lower-latency processing compared to speech-to-text pipelines. However, recent multimodal information retrieval…

Sound · Computer Science 2026-04-23 Tong Zhao , Chenghao Zhang , Yutao Zhu , Zhicheng Dou

Existing audio question answering benchmarks largely emphasize sound event classification or caption-grounded queries, often enabling models to succeed through shortcut strategies, short-duration cues, lexical priors, dataset-specific…

Computation and Language · Computer Science 2026-04-24 Tasnim Kabir , Dmytro Kurdydyk , Aadi Palnitkar , Liam Dorn , Ahmed Haj Ahmed , Jordan Lee Boyd-Graber

Community researchers have developed a range of advanced audio-visual segmentation models aimed at improving the quality of sounding objects' masks. While masks created by these models may initially appear plausible, they occasionally…

Computer Vision and Pattern Recognition · Computer Science 2024-07-24 Peiwen Sun , Honggang Zhang , Di Hu

Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the…

Sound · Computer Science 2023-10-24 Zihao He , Weituo Hao , Wei-Tsung Lu , Changyou Chen , Kristina Lerman , Xuchen Song

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved…

‹ Prev 1 2 3 10 Next ›