Related papers: OpenACE: An Open Benchmark for Evaluating Audio Co…

AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec

A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e.\ the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e.\…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-29 Yi-Chiao Wu , Israel D. Gebru , Dejan Marković , Alexander Richard

CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation

With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Ruifan Deng , Yitian Gong , Qinghui Gao , Luozhijie Jin , Qinyuan Cheng , Zhaoye Fei , Shimin Li , Xipeng Qiu

ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric

Estimation of perceptual quality in audio and speech is possible using a variety of methods. The combined v3 release of ViSQOL and ViSQOLAudio (for speech and audio, respectively,) provides improvements upon previous versions, in terms of…

Audio and Speech Processing · Electrical Eng. & Systems 2020-04-22 Michael Chinen , Felicia S. C. Lim , Jan Skoglund , Nikita Gureev , Feargus O'Gorman , Andrew Hines

ACES: Evaluating Automated Audio Captioning Models on the Semantics of Sounds

Automated Audio Captioning is a multimodal task that aims to convert audio content into natural language. The assessment of audio captioning systems is typically based on quantitative metrics applied to text data. Previous studies have…

Sound · Computer Science 2024-03-28 Gijs Wijngaard , Elia Formisano , Bruno L. Giordano , Michel Dumontier

Can Audio Captions Be Evaluated with Image Caption Metrics?

Automated audio captioning aims at generating textual descriptions for an audio clip. To evaluate the quality of generated audio captions, previous works directly adopt image captioning metrics like SPICE and CIDEr, without justifying their…

Sound · Computer Science 2022-01-28 Zelin Zhou , Zhiling Zhang , Xuenan Xu , Zeyu Xie , Mengyue Wu , Kenny Q. Zhu

Speech Quality Factors for Traditional and Neural-Based Low Bit Rate Vocoders

This study compares the performances of different algorithms for coding speech at low bit rates. In addition to widely deployed traditional vocoders, a selection of recently developed generative-model-based coders at different bit rates are…

Audio and Speech Processing · Electrical Eng. & Systems 2020-03-27 Wissam A. Jassim , Jan Skoglund , Michael Chinen , Andrew Hines

MACE: Leveraging Audio for Evaluating Audio Captioning Systems

The Automated Audio Captioning (AAC) task aims to describe an audio signal using natural language. To evaluate machine-generated captions, the metrics should take into account audio events, acoustic scenes, paralinguistics, signal…

Sound · Computer Science 2024-11-06 Satvik Dixit , Soham Deshmukh , Bhiksha Raj

AudioCapBench: Quick Evaluation on Audio Captioning across Sound, Music, and Speech

We introduce AudioCapBench, a benchmark for evaluating audio captioning capabilities of large multimodal models. \method covers three distinct audio domains, including environmental sound, music, and speech, with 1,000 curated evaluation…

Sound · Computer Science 2026-03-02 Jielin Qiu , Jianguo Zhang , Zixiang Chen , Liangwei Yang , Ming Zhu , Juntao Tan , Haolin Chen , Wenting Zhao , Rithesh Murthy , Roshan Ram , Akshara Prabhakar , Shelby Heinecke , Caiming , Xiong , Silvio Savarese , Huan Wang

ESPnet-Codec: Comprehensive Training and Evaluation of Neural Codecs for Audio, Music, and Speech

Neural codecs have become crucial to recent speech and audio generation research. In addition to signal compression capabilities, discrete codecs have also been found to enhance downstream training efficiency and compatibility with…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-25 Jiatong Shi , Jinchuan Tian , Yihan Wu , Jee-weon Jung , Jia Qi Yip , Yoshiki Masuyama , William Chen , Yuning Wu , Yuxun Tang , Massa Baali , Dareen Alharhi , Dong Zhang , Ruifan Deng , Tejes Srivastava , Haibin Wu , Alexander H. Liu , Bhiksha Raj , Qin Jin , Ruihua Song , Shinji Watanabe

ClonEval: An Open Voice Cloning Benchmark

We present a novel benchmark for voice cloning text-to-speech models. The benchmark consists of an evaluation protocol, an open-source library for assessing the performance of voice cloning models, and an accompanying leaderboard. The paper…

Computation and Language · Computer Science 2025-09-18 Iwona Christop , Tomasz Kuczyński , Marek Kubis

Speech Coding, Speech Interfaces and IoT - Opportunities and Challenges

Recent speech and audio coding standards such as 3GPP Enhanced Voice Services match the foreseeable needs and requirements in transmission of speech and audio, when using current transmission infrastructure and applications. Trends in…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-15 Tom Bäckström

Zero Resource Code-switched Speech Benchmark Using Speech Utterance Pairs For Multiple Spoken Languages

We introduce a new zero resource code-switched speech benchmark designed to directly assess the code-switching capabilities of self-supervised speech encoders. We showcase a baseline system of language modeling on discrete units to…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-19 Kuan-Po Huang , Chih-Kai Yang , Yu-Kuan Fu , Ewan Dunbar , Hung-yi Lee

Analysis of Speaker Verification Performance Trade-offs with Neural Audio Codec Transmission

Neural audio codecs (NACs) have made significant advancements in recent years and are rapidly being adopted in many audio processing pipelines. However, they can introduce audio distortions which degrade speaker verification (SV)…

Sound · Computer Science 2025-09-04 Nirmalya Mallick Thakur , Jia Qi Yip , Eng Siong Chng

BRACE: A Benchmark for Robust Audio Caption Quality Evaluation

Automatic audio captioning is essential for audio understanding, enabling applications such as accessibility and content indexing. However, evaluating the quality of audio captions remains a major challenge, especially in reference-free…

Sound · Computer Science 2025-12-12 Tianyu Guo , Hongyu Chen , Hao Liang , Meiyi Qiang , Bohan Zeng , Linzhuang Sun , Bin Cui , Wentao Zhang

SoundStream: An End-to-End Neural Audio Codec

We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully…

Sound · Computer Science 2021-07-08 Neil Zeghidour , Alejandro Luebs , Ahmed Omran , Jan Skoglund , Marco Tagliasacchi

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-06 Chunyu Qiang , Haoyu Wang , Cheng Gong , Tianrui Wang , Ruibo Fu , Tao Wang , Ruilong Chen , Jiangyan Yi , Zhengqi Wen , Chen Zhang , Longbiao Wang , Jianwu Dang , Jianhua Tao

On The Effect Of Coding Artifacts On Acoustic Scene Classification

Previous DCASE challenges contributed to an increase in the performance of acoustic scene classification systems. State-of-the-art classifiers demand significant processing capabilities and memory which is challenging for…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-10 Nagashree K. S. Rao , Nils Peters

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder

Masked token prediction has emerged as a powerful pre-training objective across language, vision, and speech, offering the potential to unify these diverse modalities through a single pre-training task. However, its application for general…

Sound · Computer Science 2025-07-21 Shikhar Bharadwaj , Samuele Cornell , Kwanghee Choi , Satoru Fukayama , Hye-jin Shim , Soham Deshmukh , Shinji Watanabe

Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions

Objective speech-quality metrics are widely used to assess codec performance. However, for neural codecs, it is often unclear which metrics provide reliable quality estimates. To address this, we evaluated 45 objective metrics by…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-30 Wolfgang Mack , Nezih Topaloglu , Laura Lechler , Ivana Balić , Alexandra Craciun , Mansur Yesilbursa , Kamil Wojcicki

X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance

We introduces X-ARES (eXtensive Audio Representation and Evaluation Suite), a novel open-source benchmark designed to systematically assess audio encoder performance across diverse domains. By encompassing tasks spanning speech,…

Sound · Computer Science 2025-05-28 Junbo Zhang , Heinrich Dinkel , Yadong Niu , Chenyu Liu , Si Cheng , Anbei Zhao , Jian Luan