Related papers: Audio Difference Learning for Audio Captioning

Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement

We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that…

Audio and Speech Processing · Electrical Eng. & Systems 2023-08-24 Daiki Takeuchi , Yasunori Ohishi , Daisuke Niizumi , Noboru Harada , Kunio Kashino

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures…

Sound · Computer Science 2022-04-20 Ayşegül Özkaya Eren , Mustafa Sert

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with…

Sound · Computer Science 2020-07-10 Emre Çakır , Konstantinos Drossos , Tuomas Virtanen

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to…

Sound · Computer Science 2022-04-19 Yiming Zhang , Hong Yu , Ruoyi Du , Zhanyu Ma , Yuan Dong

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-30 Xubo Liu , Qiushi Huang , Xinhao Mei , Haohe Liu , Qiuqiang Kong , Jianyuan Sun , Shengchen Li , Tom Ko , Yu Zhang , Lilian H. Tang , Mark D. Plumbley , Volkan Kılıç , Wenwu Wang

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-22 Paul Primus , Florian Schmid , Gerhard Widmer

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

Towards Generating Diverse Audio Captions via Adversarial Training

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-02 Xinhao Mei , Xubo Liu , Jianyuan Sun , Mark D. Plumbley , Wenwu Wang

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-05 Minkyu Kim , Kim Sung-Bin , Tae-Hyun Oh

Multi-Representation Knowledge Distillation For Audio Classification

As an important component of multimedia analysis tasks, audio classification aims to discriminate between different audio signal types and has received intensive attention due to its wide applications. Generally speaking, the raw signal can…

Multimedia · Computer Science 2020-02-25 Liang Gao , Kele Xu , Huaimin Wang , Yuxing Peng

Zero-Shot Audio Captioning Using Soft and Hard Prompts

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two…

Sound · Computer Science 2024-06-11 Yiming Zhang , Xuenan Xu , Ruoyi Du , Haohe Liu , Yuan Dong , Zheng-Hua Tan , Wenwu Wang , Zhanyu Ma

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling…

Audio and Speech Processing · Electrical Eng. & Systems 2020-12-15 Yuma Koizumi , Yasunori Ohishi , Daisuke Niizumi , Daiki Takeuchi , Masahiro Yasuda

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-24 Wei-Cheng Tseng , Xuanru Zhou , Mingyue Huo , Yiwen Shao , Hao Zhang , Dong Yu

Zero-Shot Audio Captioning via Audibility Guidance

The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii)…

Sound · Computer Science 2023-09-08 Tal Shaharabany , Ariel Shaulov , Lior Wolf

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach

Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the…

Sound · Computer Science 2021-07-19 Jan Berg , Konstantinos Drossos

Exploring the Role of Audio in Video Captioning

Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches…

Computer Vision and Pattern Recognition · Computer Science 2023-06-23 Yuhan Shen , Linjie Yang , Longyin Wen , Haichao Yu , Ehsan Elhamifar , Heng Wang

WaveTransformer: A Novel Architecture for Audio Captioning Based on Learning Temporal and Time-Frequency Information

Automated audio captioning (AAC) is a novel task, where a method takes as an input an audio sample and outputs a textual description (i.e. a caption) of its contents. Most AAC methods are adapted from from image captioning of machine…

Sound · Computer Science 2020-10-22 An Tran , Konstantinos Drossos , Tuomas Virtanen

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the…

Sound · Computer Science 2022-04-13 Chen Chen , Nana Hou , Yuchen Hu , Heqing Zou , Xiaofeng Qi , Eng Siong Chng

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou