Related papers: libACA, pyACA, and ACA-Code: Audio Content Analysi…

Audio Content Analysis

Preprint for a book chapter introducing Audio Content Analysis. With a focus on Music Information Retrieval systems, this chapter defines musical audio content, introduces the general process of audio content analysis, and surveys basic…

Audio and Speech Processing · Electrical Eng. & Systems 2021-01-05 Alexander Lerch

Digital Audio Processing Tools for Music Corpus Studies

Digital audio processing tools offer music researchers the opportunity to examine both non-notated music and music as performance. This chapter summarises the types of information that can be extracted from audio as well as currently…

Sound · Computer Science 2021-11-10 Johanna Devaney

CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation

With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Ruifan Deng , Yitian Gong , Qinghui Gao , Luozhijie Jin , Qinyuan Cheng , Zhaoye Fei , Shimin Li , Xipeng Qiu

Pyroomacoustics: A Python package for audio room simulations and array processing algorithms

We present pyroomacoustics, a software package aimed at the rapid development and testing of audio array processing algorithms. The content of the package can be divided into three main components: an intuitive Python object-oriented…

Sound · Computer Science 2019-05-08 Robin Scheibler , Eric Bezzam , Ivan Dokmanić

LLM-Assisted Content Analysis: Using Large Language Models to Support Deductive Coding

Deductive coding is a widely used qualitative research method for determining the prevalence of themes across documents. While useful, deductive coding is often burdensome and time consuming since it requires researchers to read, interpret,…

Computation and Language · Computer Science 2023-06-28 Robert Chew , John Bollenbacher , Michael Wenger , Jessica Speer , Annice Kim

Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding

Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened…

Sound · Computer Science 2024-06-26 Jizhong Liu , Gang Li , Junbo Zhang , Heinrich Dinkel , Yongqing Wang , Zhiyong Yan , Yujun Wang , Bin Wang

Multi-agent Auditory Scene Analysis

Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-21 Caleb Rascon , Luis Gato-Diaz , Eduardo García-Alarcón

CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved…

Sound · Computer Science 2024-08-01 Sreyan Ghosh , Ashish Seth , Sonal Kumar , Utkarsh Tyagi , Chandra Kiran Evuru , S. Ramaneswaran , S. Sakshi , Oriol Nieto , Ramani Duraiswami , Dinesh Manocha

CLaMP 2: Multimodal Music Information Retrieval Across 101 Languages Using Large Language Models

Challenges in managing linguistic diversity and integrating various musical modalities are faced by current music information retrieval systems. These limitations reduce their effectiveness in a global, multimodal music environment. To…

Sound · Computer Science 2025-01-27 Shangda Wu , Yashan Wang , Ruibin Yuan , Zhancheng Guo , Xu Tan , Ge Zhang , Monan Zhou , Jing Chen , Xuefeng Mu , Yuejie Gao , Yuanliang Dong , Jiafeng Liu , Xiaobing Li , Feng Yu , Maosong Sun

Watch, Listen, and Describe: Globally and Locally Aligned Cross-Modal Attentions for Video Captioning

A major challenge for video captioning is to combine audio and visual cues. Existing multi-modal fusion methods have shown encouraging results in video understanding. However, the temporal structures of multiple modalities at different…

Computation and Language · Computer Science 2018-04-17 Xin Wang , Yuan-Fang Wang , William Yang Wang

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-29 Yi-Chiao Wu , Israel D. Gebru , Dejan Marković , Alexander Richard

PCA Method for Automated Detection of Mispronounced Words

This paper presents a method for detecting mispronunciations with the aim of improving Computer Assisted Language Learning (CALL) tools used by foreign language learners. The algorithm is based on Principle Component Analysis (PCA). It is…

Sound · Computer Science 2016-02-29 Zhenhao Ge , Sudhendu R. Sharma , Mark J. T. Smith

CLAIR-A: Leveraging Large Language Models to Judge Audio Captions

The Automated Audio Captioning (AAC) task asks models to generate natural language descriptions of an audio input. Evaluating these machine-generated audio captions is a complex task that requires considering diverse factors, among them,…

Computation and Language · Computer Science 2025-08-12 Tsung-Han Wu , Joseph E. Gonzalez , Trevor Darrell , David M. Chan

pyAMPACT: A Score-Audio Alignment Toolkit for Performance Data Estimation and Multi-modal Processing

pyAMPACT (Python-based Automatic Music Performance Analysis and Comparison Toolkit) links symbolic and audio music representations to facilitate score-informed estimation of performance data in audio as well as general linking of symbolic…

Sound · Computer Science 2026-01-06 Johanna Devaney , Daniel McKemie , Alex Morgan

Investigations in Audio Captioning: Addressing Vocabulary Imbalance and Evaluating Suitability of Language-Centric Performance Metrics

The analysis, processing, and extraction of meaningful information from sounds all around us is the subject of the broader area of audio analytics. Audio captioning is a recent addition to the domain of audio analytics, a cross-modal…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-04 Sandeep Kothinti , Dimitra Emmanouilidou

Fast Audio Codec Identification Using Overlapping LCS

Audio data are widely exchanged over telecommunications networks. Due to the limitations of network resources, these data are typically compressed before transmission. Various methods are available for compressing audio data. To access such…

Multimedia · Computer Science 2025-02-12 Farzane Jafari

Improving the Performance of Automated Audio Captioning via Integrating the Acoustic and Semantic Information

Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic signal processing and natural language processing to generate human-readable sentences for audio clips. The current models are generally based on the…

Sound · Computer Science 2021-10-13 Zhongjie Ye , Helin Wang , Dongchao Yang , Yuexian Zou

SemanticAC: Semantics-Assisted Framework for Audio Classification

In this paper, we propose SemanticAC, a semantics-assisted framework for Audio Classification to better leverage the semantic information. Unlike conventional audio classification methods that treat class labels as discrete vectors, we…

Sound · Computer Science 2023-02-14 Yicheng Xiao , Yue Ma , Shuyan Li , Hantao Zhou , Ran Liao , Xiu Li

SUNAC: Source-aware Unified Neural Audio Codec

Neural audio codecs (NACs) provide compact representations that can be leveraged in many downstream applications, in particular large language models. Yet most NACs encode mixtures of multiple sources in an entangled manner, which may…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-21 Ryo Aihara , Yoshiki Masuyama , Francesco Paissan , François G. Germain , Gordon Wichern , Jonathan Le Roux

Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques

To address the calibration and procedural challenges inherent in remote audiogram assessment for rehabilitative audiology, this study investigated whether calibration-independent adaptive categorical loudness scaling (ACALOS) data can be…

Sound · Computer Science 2026-04-07 Chen Xu , Lena Schell-Majoor , Birger Kollmeier