Related papers: Caption Feature Space Regularization for Audio Cap…

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling…

Audio and Speech Processing · Electrical Eng. & Systems 2020-12-15 Yuma Koizumi , Yasunori Ohishi , Daisuke Niizumi , Daiki Takeuchi , Masahiro Yasuda

Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention

Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-30 Xubo Liu , Qiushi Huang , Xinhao Mei , Haohe Liu , Qiuqiang Kong , Jianyuan Sun , Shengchen Li , Tom Ko , Yu Zhang , Lilian H. Tang , Mark D. Plumbley , Volkan Kılıç , Wenwu Wang

Classifier-Guided Captioning Across Modalities

Most current captioning systems use language models trained on data from specific settings, such as image-based captioning via Amazon Mechanical Turk, limiting their ability to generalize to other modality distributions and contexts. This…

Computation and Language · Computer Science 2025-01-07 Ariel Shaulov , Tal Shaharabany , Eitan Shaar , Gal Chechik , Lior Wolf

Diverse Audio Captioning via Adversarial Training

Audio captioning aims at generating natural language descriptions for audio clips automatically. Existing audio captioning models have shown promising improvement in recent years. However, these models are mostly trained via maximum…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-30 Xinhao Mei , Xubo Liu , Jianyuan Sun , Mark D. Plumbley , Wenwu Wang

Multi-task Regularization Based on Infrequent Classes for Audio Captioning

Audio captioning is a multi-modal task, focusing on using natural language for describing the contents of general audio. Most audio captioning methods are based on deep neural networks, employing an encoder-decoder scheme and a dataset with…

Sound · Computer Science 2020-07-10 Emre Çakır , Konstantinos Drossos , Tuomas Virtanen

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-18 Tatsuya Komatsu , Yusuke Fujita , Kazuya Takeda , Tomoki Toda

Zero-Shot Audio Captioning via Audibility Guidance

The task of audio captioning is similar in essence to tasks such as image and video captioning. However, it has received much less attention. We propose three desiderata for captioning audio -- (i) fluency of the generated text, (ii)…

Sound · Computer Science 2023-09-08 Tal Shaharabany , Ariel Shaulov , Lior Wolf

Zero-Shot Audio Captioning Using Soft and Hard Prompts

In traditional audio captioning methods, a model is usually trained in a fully supervised manner using a human-annotated dataset containing audio-text pairs and then evaluated on the test sets from the same dataset. Such methods have two…

Sound · Computer Science 2024-06-11 Yiming Zhang , Xuenan Xu , Ruoyi Du , Haohe Liu , Yuan Dong , Zheng-Hua Tan , Wenwu Wang , Zhanyu Ma

Improving Audio Captioning Using Semantic Similarity Metrics

Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric…

Multimedia · Computer Science 2023-03-06 Rehana Mahfuz , Yinyi Guo , Erik Visser

Crowdsourcing a Dataset of Audio Captions

Audio captioning is a novel field of multi-modal translation and it is the task of creating a textual description of the content of an audio signal (e.g. "people talking in a big room"). The creation of a dataset for this task requires a…

Sound · Computer Science 2019-07-23 Samuel Lipping , Konstantinos Drossos , Tuomas Virtanen

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-05 Minkyu Kim , Kim Sung-Bin , Tae-Hyun Oh

ALCAP: Alignment-Augmented Music Captioner

Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the…

Sound · Computer Science 2023-10-24 Zihao He , Weituo Hao , Wei-Tsung Lu , Changyou Chen , Kristina Lerman , Xuchen Song

Towards Generating Diverse Audio Captions via Adversarial Training

Automated audio captioning is a cross-modal translation task for describing the content of audio clips with natural language sentences. This task has attracted increasing attention and substantial progress has been made in recent years.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-02 Xinhao Mei , Xubo Liu , Jianyuan Sun , Mark D. Plumbley , Wenwu Wang

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures…

Sound · Computer Science 2022-04-20 Ayşegül Özkaya Eren , Mustafa Sert

Exploring the Role of Audio in Video Captioning

Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M. Though these approaches…

Computer Vision and Pattern Recognition · Computer Science 2023-06-23 Yuhan Shen , Linjie Yang , Longyin Wen , Haichao Yu , Ehsan Elhamifar , Heng Wang

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-08 Khoa Nguyen , Konstantinos Drossos , Tuomas Virtanen

Imagine How To Change: Explicit Procedure Modeling for Change Captioning

Change captioning generates descriptions that explicitly describe the differences between two visually similar images. Existing methods operate on static image pairs, thus ignoring the rich temporal dynamics of the change procedure, which…

Computer Vision and Pattern Recognition · Computer Science 2026-03-09 Jiayang Sun , Zixin Guo , Min Cao , Guibo Zhu , Jorma Laaksonen

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Audio-language pretraining holds promise for general-purpose audio understanding, yet remains underexplored compared to its vision counterpart. While vision-language models like CLIP serve as widely adopted foundations, existing…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-24 Wei-Cheng Tseng , Xuanru Zhou , Mingyue Huo , Yiwen Shao , Hao Zhang , Dong Yu

Automated Audio Captioning: An Overview of Recent Progress and New Challenges

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-28 Xinhao Mei , Xubo Liu , Mark D. Plumbley , Wenwu Wang