Related papers: Fast Audio Codec Identification Using Overlapping …

Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-19 Edresson Casanova , Ryan Langman , Paarth Neekhara , Shehzeen Hussain , Jason Li , Subhankar Ghosh , Ante Jukić , Sang-gil Lee

Uniform Convergence Bounds for Codec Selection

We frame the problem of selecting an optimal audio encoding scheme as a supervised learning task. Through uniform convergence theory, we guarantee approximately optimal codec selection while controlling for selection bias. We present…

Sound · Computer Science 2018-12-20 Clayton Sanford , Cyrus Cousins , Eli Upfal

Towards audio language modeling -- an overview

Neural audio codecs are initially introduced to compress audio data into compact codes to reduce transmission latency. Researchers recently discovered the potential of codecs as suitable tokenizers for converting continuous audio into…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-21 Haibin Wu , Xuanjun Chen , Yi-Cheng Lin , Kai-wei Chang , Ho-Lam Chung , Alexander H. Liu , Hung-yi Lee

Audio classification with Dilated Convolution with Learnable Spacings

Dilated convolution with learnable spacings (DCLS) is a recent convolution method in which the positions of the kernel elements are learned throughout training by backpropagation. Its interest has recently been demonstrated in computer…

Sound · Computer Science 2023-11-23 Ismail Khalfaoui-Hassani , Timothée Masquelier , Thomas Pellegrini

CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation

With the rise of multimodal large language models (LLMs), audio codec plays an increasingly vital role in encoding audio into discrete tokens, enabling integration of audio into text-based LLMs. Current audio codec captures two types of…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-29 Ruifan Deng , Yitian Gong , Qinghui Gao , Luozhijie Jin , Qinyuan Cheng , Zhaoye Fei , Shimin Li , Xipeng Qiu

Shift-Invariance Sparse Coding for Audio Classification

Sparse coding is an unsupervised learning algorithm that learns a succinct high-level representation of the inputs given only unlabeled data; it represents each input as a sparse linear combination of a set of basis functions. Originally…

Machine Learning · Computer Science 2012-06-26 Roger Grosse , Rajat Raina , Helen Kwong , Andrew Y. Ng

Xampling: Compressed Sensing of Analog Signals

Xampling generalizes compressed sensing (CS) to reduced-rate sampling of analog signals. A unified framework is introduced for low rate sampling and processing of signals lying in a union of subspaces. Xampling consists of two main blocks:…

Information Theory · Computer Science 2015-03-19 Moshe Mishali , Yonina C. Eldar

Low-Complexity Acoustic Scene Classification Using Data Augmentation and Lightweight ResNet

We present a work on low-complexity acoustic scene classification (ASC) with multiple devices, namely the subtask A of Task 1 of the DCASE2021 challenge. This subtask focuses on classifying audio samples of multiple devices with a…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-06 Yanxiong Li , Wenchang Cao , Wei Xie , Qisheng Huang , Wenfeng Pang , Qianhua He

Towards Audio Token Compression in Large Audio Language Models

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-27 Saurabhchand Bhati , Samuel Thomas , Hilde Kuehne , Rogerio Feris , James Glass

LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-15 Zongli Ye , Jiachen Lian , Akshaj Gupta , Xuanru Zhou , Haodong Li , Krish Patel , Hwi Joo Park , Dingkun Zhou , Chenxu Guo , Shuhe Li , Sam Wang , Iris Zhou , Cheol Jun Cho , Zoe Ezzes , Jet M. J. Vonk , Brittany T. Morin , Rian Bogley , Lisa Wauters , Zachary A. Miller , Maria Luisa Gorno-Tempini , Gopala Anumanchipalli

A High Fidelity and Low Complexity Neural Audio Coding

Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor…

Sound · Computer Science 2023-10-18 Wenzhe Liu , Wei Xiao , Meng Wang , Shan Yang , Yupeng Shi , Yuyong Kang , Dan Su , Shidong Shang , Dong Yu

A Compact and Discriminative Feature Based on Auditory Summary Statistics for Acoustic Scene Classification

One of the biggest challenges of acoustic scene classification (ASC) is to find proper features to better represent and characterize environmental sounds. Environmental sounds generally involve more sound sources while exhibiting less…

Sound · Computer Science 2019-04-11 Hongwei Song , Jiqing Han , Shiwen Deng

Toward a Sparse and Interpretable Audio Codec

Most widely-used modern audio codecs, such as Ogg Vorbis and MP3, as well as more recent "neural" codecs like Meta's Encodec or the Descript Audio Codec are based on block-coding; audio is divided into overlapping, fixed-size "frames" which…

Sound · Computer Science 2025-05-12 John Vinyard

Continuous speech separation: dataset and analysis

This paper describes a dataset and protocols for evaluating continuous speech separation algorithms. Most prior studies on speech separation use pre-segmented signals of artificially mixed speech utterances which are mostly \emph{fully}…

Sound · Computer Science 2020-05-08 Zhuo Chen , Takuya Yoshioka , Liang Lu , Tianyan Zhou , Zhong Meng , Yi Luo , Jian Wu , Xiong Xiao , Jinyu Li

Unsupervised Feature Learning for Audio Analysis

Identifying acoustic events from a continuously streaming audio source is of interest for many applications including environmental monitoring for basic research. In this scenario neither different event classes are known nor what…

Computer Vision and Pattern Recognition · Computer Science 2017-12-12 Matthias Meyer , Jan Beutel , Lothar Thiele

Bird Species Classification And Acoustic Features Selection Based on Distributed Neural Network with Two Stage Windowing of Short-Term Features

Identification of bird species from audio records is one of the challenging tasks due to the existence of multiple species in the same recording, noise in the background, and long-term recording. Besides, choosing a proper acoustic feature…

Sound · Computer Science 2022-01-04 Nahian Ibn Hasan

Artificially Synthesising Data for Audio Classification and Segmentation to Improve Speech and Music Detection in Radio Broadcast

Segmenting audio into homogeneous sections such as music and speech helps us understand the content of audio. It is useful as a pre-processing step to index, store, and modify audio recordings, radio broadcasts and TV programmes. Deep…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-22 Satvik Venkatesh , David Moffat , Alexis Kirke , Gözel Shakeri , Stephen Brewster , Jörg Fachner , Helen Odell-Miller , Alex Street , Nicolas Farina , Sube Banerjee , Eduardo Reck Miranda

Capturing scattered discriminative information using a deep architecture in acoustic scene classification

Frequently misclassified pairs of classes that share many common acoustic properties exist in acoustic scene classification (ASC). To distinguish such pairs of classes, trivial details scattered throughout the data could be vital clues.…

Audio and Speech Processing · Electrical Eng. & Systems 2020-07-10 Hye-jin Shim , Jee-weon Jung , Ju-ho Kim , Ha-jin Yu

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The…

Sound · Computer Science 2024-11-26 Anton Ratnarajah , Shi-Xiong Zhang , Dong Yu

CLC: Complex Linear Coding for the DNS 2020 Challenge

Complex-valued processing brought deep learning-based speech enhancement and signal extraction to a new level. Typically, the noise reduction process is based on a time-frequency (TF) mask which is applied to a noisy spectrogram. Complex…

Audio and Speech Processing · Electrical Eng. & Systems 2020-06-24 Hendrik Schröter , Tobias Rosenkranz , Alberto N. Escalante-B. , Andreas Maier