Related papers: Multi-task Regularization Based on Infrequent Clas…

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou

Multitask learning in Audio Captioning: a sentence embedding regression loss acts as a regularizer

In this work, we propose to study the performance of a model trained with a sentence embedding regression loss component for the Automated Audio Captioning task. This task aims to build systems that can describe audio content with a single…

Sound · Computer Science 2023-05-03 Etienne Labbé , Julien Pinquier , Thomas Pellegrini

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

Audio Difference Learning for Audio Captioning

This study introduces a novel training paradigm, audio difference learning, for improving audio captioning. The fundamental concept of the proposed learning method is to create a feature representation space that preserves the relationship…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-18 Tatsuya Komatsu , Yusuke Fujita , Kazuya Takeda , Tomoki Toda

Caption Feature Space Regularization for Audio Captioning

Audio captioning aims at describing the content of audio clips with human language. Due to the ambiguity of audio, different people may perceive the same audio differently, resulting in caption disparities (i.e., one audio may correlate to…

Sound · Computer Science 2022-04-19 Yiming Zhang , Hong Yu , Ruoyi Du , Zhanyu Ma , Yuan Dong

Automated Audio Captioning: An Overview of Recent Progress and New Challenges

Automated audio captioning is a cross-modal translation task that aims to generate natural language descriptions for given audio clips. This task has received increasing attention with the release of freely available datasets in recent…

Audio and Speech Processing · Electrical Eng. & Systems 2022-09-28 Xinhao Mei , Xubo Liu , Mark D. Plumbley , Wenwu Wang

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures…

Sound · Computer Science 2022-04-20 Ayşegül Özkaya Eren , Mustafa Sert

Estimated Audio-Caption Correspondences Improve Language-Based Audio Retrieval

Dual-encoder-based audio retrieval systems are commonly optimized with contrastive learning on a set of matching and mismatching audio-caption pairs. This leads to a shared embedding space in which corresponding items from the two…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-22 Paul Primus , Florian Schmid , Gerhard Widmer

MusCaps: Generating Captions for Music Audio

Content-based music information retrieval has seen rapid progress with the adoption of deep learning. Current approaches to high-level music description typically make use of classification models, such as in auto-tagging or genre and mood…

Sound · Computer Science 2021-12-09 Ilaria Manco , Emmanouil Benetos , Elio Quinton , Gyorgy Fazekas

Prefix tuning for automated audio captioning

Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio-text paired training data. In this work, we propose a simple…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-05 Minkyu Kim , Kim Sung-Bin , Tae-Hyun Oh

Retrieval-Augmented Approach for Unsupervised Anomalous Sound Detection and Captioning without Model Training

This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-30 Ryoya Ogura , Tomoya Nishida , Yohei Kawaguchi

Clotho: An Audio Captioning Dataset

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e.…

Sound · Computer Science 2019-10-22 Konstantinos Drossos , Samuel Lipping , Tuomas Virtanen

SpeechCaps: Advancing Instruction-Based Universal Speech Models with Multi-Talker Speaking Style Captioning

Instruction-based speech processing is becoming popular. Studies show that training with multiple tasks boosts performance, but collecting diverse, large-scale tasks and datasets is expensive. Thus, it is highly desirable to design a…

Computation and Language · Computer Science 2024-08-27 Chien-yu Huang , Min-Han Shih , Ke-Han Lu , Chi-Yuan Hsiao , Hung-yi Lee

Audio Captioning using Pre-Trained Large-Scale Language Model Guided by Audio-based Similar Caption Retrieval

The goal of audio captioning is to translate input audio into its description using natural language. One of the problems in audio captioning is the lack of training data due to the difficulty in collecting audio-caption pairs by crawling…

Audio and Speech Processing · Electrical Eng. & Systems 2020-12-15 Yuma Koizumi , Yasunori Ohishi , Daisuke Niizumi , Daiki Takeuchi , Masahiro Yasuda

An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment

Multimodal large language models have fueled progress in image captioning. These models, fine-tuned on vast image datasets, exhibit a deep understanding of semantic concepts. In this work, we show that this ability can be re-purposed for…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-10 Hugo Malard , Michel Olvera , Stéphane Lathuiliere , Slim Essid

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened…

Sound · Computer Science 2024-06-26 Jizhong Liu , Gang Li , Junbo Zhang , Heinrich Dinkel , Yongqing Wang , Zhiyong Yan , Yujun Wang , Bin Wang

ALCAP: Alignment-Augmented Music Captioner

Music captioning has gained significant attention in the wake of the rising prominence of streaming media platforms. Traditional approaches often prioritize either the audio or lyrics aspect of the music, inadvertently ignoring the…

Sound · Computer Science 2023-10-24 Zihao He , Weituo Hao , Wei-Tsung Lu , Changyou Chen , Kristina Lerman , Xuchen Song

Automated Audio Captioning with Recurrent Neural Networks

We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the…

Sound · Computer Science 2017-10-25 Konstantinos Drossos , Sharath Adavanne , Tuomas Virtanen

Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-08 Khoa Nguyen , Konstantinos Drossos , Tuomas Virtanen