Related papers: SemanticAC: Semantics-Assisted Framework for Audio…

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

CLAP-ART: Automated Audio Captioning with Semantic-rich Audio Representation Tokenizer

Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Daiki Takeuchi , Binh Thien Nguyen , Masahiro Yasuda , Yasunori Ohishi , Daisuke Niizumi , Noboru Harada

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened…

Sound · Computer Science 2024-06-26 Jizhong Liu , Gang Li , Junbo Zhang , Heinrich Dinkel , Yongqing Wang , Zhiyong Yan , Yujun Wang , Bin Wang

Discrete Audio Representations for Automated Audio Captioning

Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to…

Sound · Computer Science 2025-05-22 Jingguang Tian , Haoqin Sun , Xinhui Hu , Xinkang Xu

CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation

With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Ruifan Deng , Yitian Gong , Qinghui Gao , Luozhijie Jin , Qinyuan Cheng , Zhaoye Fei , Shimin Li , Xipeng Qiu

Zero-Shot Audio Classification via Semantic Embeddings

In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-12 Huang Xie , Tuomas Virtanen

SAC: Semantic Attention Composition for Text-Conditioned Image Retrieval

The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to…

Computer Vision and Pattern Recognition · Computer Science 2021-10-22 Surgan Jandial , Pinkesh Badjatiya , Pranit Chawla , Ayush Chopra , Mausoom Sarkar , Balaji Krishnamurthy

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The assessment of audio captioning systems is typically based on quantitative metrics applied to text data. Previous studies have…

Sound · Computer Science 2024-03-28 Gijs Wijngaard , Elia Formisano , Bruno L. Giordano , Michel Dumontier

ECHO: Environmental Sound Classification with Hierarchical Ontology-guided Semi-Supervised Learning

Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards…

Sound · Computer Science 2024-09-24 Pranav Gupta , Raunak Sharma , Rashmi Kumari , Sri Krishna Aditya , Shwetank Choudhary , Sumit Kumar , Kanchana M , Thilagavathy R

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs…

Sound · Computer Science 2024-12-02 Haohe Liu , Xuenan Xu , Yi Yuan , Mengyue Wu , Wenwu Wang , Mark D. Plumbley

Automated Audio Captioning using Audio Event Clues

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures…

Sound · Computer Science 2022-04-20 Ayşegül Özkaya Eren , Mustafa Sert

Investigating Local and Global Information for Automated Audio Captioning with Transfer Learning

Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic…

Sound · Computer Science 2021-02-24 Xuenan Xu , Heinrich Dinkel , Mengyue Wu , Zeyu Xie , Kai Yu

Leveraging Self-supervised Audio Representations for Data-Efficient Acoustic Scene Classification

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a…

Sound · Computer Science 2024-08-28 Yiqiang Cai , Shengchen Li , Xi Shao

SpeechMLC: Speech Multi-label Classification

In this paper, we propose a multi-label classification framework to detect multiple speaking styles in a speech sample. Unlike previous studies that have primarily focused on identifying a single target style, our framework effectively…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-19 Miseul Kim , Seyun Um , Hyeonjin Cha , Hong-goo Kang

Towards Deep Modeling of Music Semantics using EEG Regularizers

Modeling of music audio semantics has been previously tackled through learning of mappings from audio data to high-level tags or latent unsupervised spaces. The resulting semantic spaces are theoretically limited, either because the chosen…

Information Retrieval · Computer Science 2017-12-18 Francisco Raposo , David Martins de Matos , Ricardo Ribeiro , Suhua Tang , Yi Yu

Synth-AC: Enhancing Audio Captioning with Synthetic Supervision

Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which…

Sound · Computer Science 2023-09-19 Feiyang Xiao , Qiaoxi Zhu , Jian Guan , Xubo Liu , Haohe Liu , Kejia Zhang , Wenwu Wang

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings…

Computation and Language · Computer Science 2026-04-07 Nicolas Calbucura , Jose Guillen , Valentin Barriere

Improving Audio Captioning Using Semantic Similarity Metrics

Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric…

Multimedia · Computer Science 2023-03-06 Rehana Mahfuz , Yinyi Guo , Erik Visser

Semantic Codebooks as Effective Priors for Neural Speech Compression

Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on…

Sound · Computer Science 2025-12-29 Liuyang Bai , Weiyi Lu , Li Guo

AudioCodecBench: A Comprehensive Benchmark for Audio Codec Evaluation

Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global…

Sound · Computer Science 2025-09-05 Lu Wang , Hao Chen , Siyu Wu , Zhiyue Wu , Hao Zhou , Chengfeng Zhang , Ting Wang , Haodi Zhang