English
Related papers

Related papers: SemanticAC: Semantics-Assisted Framework for Audio…

200 papers

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

Automated Audio Captioning (AAC) aims to describe the semantic contexts of general sounds, including acoustic events and scenes, by leveraging effective acoustic features. To enhance performance, an AAC method, EnCLAP, employed discrete…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-03 Daiki Takeuchi , Binh Thien Nguyen , Masahiro Yasuda , Yasunori Ohishi , Daisuke Niizumi , Noboru Harada

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened…

Sound · Computer Science 2024-06-26 Jizhong Liu , Gang Li , Junbo Zhang , Heinrich Dinkel , Yongqing Wang , Zhiyong Yan , Yujun Wang , Bin Wang

Discrete audio representations, termed audio tokens, are broadly categorized into semantic and acoustic tokens, typically generated through unsupervised tokenization of continuous audio representations. However, their applicability to…

Sound · Computer Science 2025-05-22 Jingguang Tian , Haoqin Sun , Xinhui Hu , Xinkang Xu

With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Ruifan Deng , Yitian Gong , Qinghui Gao , Luozhijie Jin , Qinyuan Cheng , Zhaoye Fei , Shimin Li , Xipeng Qiu

In this paper, we study zero-shot learning in audio classification via semantic embeddings extracted from textual labels and sentence descriptions of sound classes. Our goal is to obtain a classifier that is capable of recognizing audio…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-12 Huang Xie , Tuomas Virtanen

The ability to efficiently search for images is essential for improving the user experiences across various products. Incorporating user feedback, via multi-modal inputs, to navigate visual search can help tailor retrieved results to…

Computer Vision and Pattern Recognition · Computer Science 2021-10-22 Surgan Jandial , Pinkesh Badjatiya , Pranit Chawla , Ayush Chopra , Mausoom Sarkar , Balaji Krishnamurthy

Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The assessment of audio captioning systems is typically based on quantitative metrics applied to text data. Previous studies have…

Sound · Computer Science 2024-03-28 Gijs Wijngaard , Elia Formisano , Bruno L. Giordano , Michel Dumontier

Environment Sound Classification has been a well-studied research problem in the field of signal processing and up till now more focus has been laid on fully supervised approaches. Over the last few years, focus has moved towards…

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs…

Sound · Computer Science 2024-12-02 Haohe Liu , Xuenan Xu , Yi Yuan , Mengyue Wu , Wenwu Wang , Mark D. Plumbley

Audio captioning is an important research area that aims to generate meaningful descriptions for audio clips. Most of the existing research extracts acoustic features of audio clips as input to encoder-decoder and transformer architectures…

Sound · Computer Science 2022-04-20 Ayşegül Özkaya Eren , Mustafa Sert

Automated audio captioning (AAC) aims at generating summarizing descriptions for audio clips. Multitudinous concepts are described in an audio caption, ranging from local information such as sound events to global information like acoustic…

Sound · Computer Science 2021-02-24 Xuenan Xu , Heinrich Dinkel , Mengyue Wu , Zeyu Xie , Kai Yu

Acoustic scene classification (ASC) predominantly relies on supervised approaches. However, acquiring labeled data for training ASC models is often costly and time-consuming. Recently, self-supervised learning (SSL) has emerged as a…

Sound · Computer Science 2024-08-28 Yiqiang Cai , Shengchen Li , Xi Shao

In this paper, we propose a multi-label classification framework to detect multiple speaking styles in a speech sample. Unlike previous studies that have primarily focused on identifying a single target style, our framework effectively…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-19 Miseul Kim , Seyun Um , Hyeonjin Cha , Hong-goo Kang

Modeling of music audio semantics has been previously tackled through learning of mappings from audio data to high-level tags or latent unsupervised spaces. The resulting semantic spaces are theoretically limited, either because the chosen…

Information Retrieval · Computer Science 2017-12-18 Francisco Raposo , David Martins de Matos , Ricardo Ribeiro , Suhua Tang , Yi Yu

Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which…

Sound · Computer Science 2023-09-19 Feiyang Xiao , Qiaoxi Zhu , Jian Guan , Xubo Liu , Haohe Liu , Kejia Zhang , Wenwu Wang

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings…

Computation and Language · Computer Science 2026-04-07 Nicolas Calbucura , Jose Guillen , Valentin Barriere

Audio captioning quality metrics which are typically borrowed from the machine translation and image captioning areas measure the degree of overlap between predicted tokens and gold reference tokens. In this work, we consider a metric…

Multimedia · Computer Science 2023-03-06 Rehana Mahfuz , Yinyi Guo , Erik Visser

Speech codecs are traditionally optimized for waveform fidelity, allocating bits to preserve acoustic detail even when much of it can be inferred from linguistic structure. This leads to inefficient compression and suboptimal performance on…

Sound · Computer Science 2025-12-29 Liuyang Bai , Weiyi Lu , Li Guo

Multimodal Large Language Models (MLLMs) have been widely applied in speech and music. This tendency has led to a focus on audio tokenization for Large Models (LMs). Unlike semantic-only text tokens, audio tokens must both capture global…

Sound · Computer Science 2025-09-05 Lu Wang , Hao Chen , Siyu Wu , Zhiyue Wu , Hao Zhou , Chengfeng Zhang , Ting Wang , Haodi Zhang
‹ Prev 1 2 3 10 Next ›