Related papers: Multilingual Audio Captioning using machine transl…

Weakly-supervised Automated Audio Captioning via text only training

In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to…

Sound · Computer Science 2023-09-22 Theodoros Kouzelis , Vassilis Katsouros

Training Audio Captioning Models without Audio

Automated Audio Captioning (AAC) is the task of generating natural language descriptions given an audio stream. A typical AAC system requires manually curated training data of audio segments and corresponding text caption annotations. The…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-15 Soham Deshmukh , Benjamin Elizalde , Dimitra Emmanouilidou , Bhiksha Raj , Rita Singh , Huaming Wang

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

Is my automatic audio captioning system so bad? spider-max: a metric to consider several caption candidates

Automatic Audio Captioning (AAC) is the task that aims to describe an audio signal using natural language. AAC systems take as input an audio signal and output a free-form text sentence, called a caption. Evaluating such systems is not…

Sound · Computer Science 2022-11-17 Etienne Labbé , Thomas Pellegrini , Julien Pinquier

Killing two birds with one stone: Can an audio captioning system also be used for audio-text retrieval?

Automated Audio Captioning (AAC) aims to develop systems capable of describing an audio recording using a textual sentence. In contrast, Audio-Text Retrieval (ATR) systems seek to find the best matching audio recording(s) for a given…

Computation and Language · Computer Science 2023-08-30 Etienne Labbé , Thomas Pellegrini , Julien Pinquier

Evaluating Off-the-Shelf Machine Listening and Natural Language Models for Automated Audio Captioning

Automated audio captioning (AAC) is the task of automatically generating textual descriptions for general audio signals. A captioning system has to identify various information from the input signal and express it with natural language.…

Machine Learning · Computer Science 2021-10-15 Benno Weck , Xavier Favory , Konstantinos Drossos , Xavier Serra

Interactive Audio-text Representation for Automated Audio Captioning with Contrastive Learning

Automated Audio captioning (AAC) is a cross-modal task that generates natural language to describe the content of input audio. Most prior works usually extract single-modality acoustic features and are therefore sub-optimal for the…

Sound · Computer Science 2022-04-13 Chen Chen , Nana Hou , Yuchen Hu , Heqing Zou , Xiaofeng Qi , Eng Siong Chng

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened…

Sound · Computer Science 2024-06-26 Jizhong Liu , Gang Li , Junbo Zhang , Heinrich Dinkel , Yongqing Wang , Zhiyong Yan , Yujun Wang , Bin Wang

Clotho: An Audio Captioning Dataset

Audio captioning is the novel task of general audio content description using free text. It is an intermodal translation task (not speech-to-text), where a system accepts as an input an audio signal and outputs the textual description (i.e.…

Sound · Computer Science 2019-10-22 Konstantinos Drossos , Samuel Lipping , Tuomas Virtanen

Auto-ACD: A Large-scale Dataset for Audio-Language Representation Learning

Recently, the AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, for audio representation learning, existing datasets suffer from limitations in the…

Sound · Computer Science 2024-09-10 Luoyi Sun , Xuenan Xu , Mengyue Wu , Weidi Xie

SLAM-AAC: Enhancing Audio Captioning with Paraphrasing Augmentation and CLAP-Refine through LLMs

Automated Audio Captioning (AAC) aims to generate natural textual descriptions for input audio signals. Recent progress in audio pre-trained models and large language models (LLMs) has significantly enhanced audio understanding and textual…

Audio and Speech Processing · Electrical Eng. & Systems 2024-10-15 Wenxi Chen , Ziyang Ma , Xiquan Li , Xuenan Xu , Yuzhe Liang , Zhisheng Zheng , Kai Yu , Xie Chen

Improving Audio Caption Fluency with Automatic Error Correction

Automated audio captioning (AAC) is an important cross-modality translation task, aiming at generating descriptions for audio clips. However, captions generated by previous AAC models have faced ``false-repetition'' errors due to the…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-21 Hanxue Zhang , Zeyu Xie , Xuenan Xu , Mengyue Wu , Kai Yu

CL4AC: A Contrastive Loss for Audio Captioning

Automated Audio captioning (AAC) is a cross-modal translation task that aims to use natural language to describe the content of an audio clip. As shown in the submissions received for Task 6 of the DCASE 2021 Challenges, this problem has…

Audio and Speech Processing · Electrical Eng. & Systems 2021-11-23 Xubo Liu , Qiushi Huang , Xinhao Mei , Tom Ko , H Lilian Tang , Mark D. Plumbley , Wenwu Wang

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic…

Sound · Computer Science 2021-02-24 Xuenan Xu , Heinrich Dinkel , Mengyue Wu , Zeyu Xie , Kai Yu

Continual Learning for Automated Audio Captioning Using The Learning Without Forgetting Approach

Automated audio captioning (AAC) is the task of automatically creating textual descriptions (i.e. captions) for the contents of a general audio signal. Most AAC methods are using existing datasets to optimize and/or evaluate upon. Given the…

Sound · Computer Science 2021-07-19 Jan Berg , Konstantinos Drossos

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them,…

Computation and Language · Computer Science 2025-08-12 Tsung-Han Wu , Joseph E. Gonzalez , Trevor Darrell , David M. Chan

RECAP: Retrieval-Augmented Audio Captioning

We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-07 Sreyan Ghosh , Sonal Kumar , Chandra Kiran Reddy Evuru , Ramani Duraiswami , Dinesh Manocha

CoNeTTE: An efficient Audio Captioning system leveraging multiple datasets with Task Embedding

Automated Audio Captioning (AAC) involves generating natural language descriptions of audio content, using encoder-decoder architectures. An audio encoder produces audio embeddings fed to a decoder, usually a Transformer decoder, for…

Sound · Computer Science 2023-09-04 Étienne Labbé , Thomas Pellegrini , Julien Pinquier

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The assessment of audio captioning systems is typically based on quantitative metrics applied to text data. Previous studies have…

Sound · Computer Science 2024-03-28 Gijs Wijngaard , Elia Formisano , Bruno L. Giordano , Michel Dumontier

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal…

Sound · Computer Science 2024-11-06 Satvik Dixit , Soham Deshmukh , Bhiksha Raj