Related papers: SpatialCodec: Neural Spatial Speech Coding

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array

Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-17 Yue Qiao , Vinay Kothapally , Meng Yu , Dong Yu

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into…

Sound · Computer Science 2024-10-22 Peiji Yang , Fengping Wang , Yicheng Zhong , Huawei Wei , Zhisheng Wang

SuperCodec: A Neural Speech Codec with Selective Back-Projection Network

Neural speech coding is a rapidly developing topic, where state-of-the-art approaches now exhibit superior compression performance than conventional methods. Despite significant progress, existing methods still have limitations in…

Sound · Computer Science 2024-07-31 Youqiang Zheng , Weiping Tu , Li Xiao , Xinmeng Xu

BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction

Neural audio codecs have recently enabled high-fidelity reconstruction at high compression rates, especially for speech. However, speech and non-speech audio exhibit fundamentally different spectral characteristics: speech energy…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-11 Haoran Wang , Jiatong Shi , Jinchuan Tian , Bohan Li , Kai Yu , Shinji Watanabe

Self-Supervised Learning of Spatial Acoustic Representation with Cross-Channel Signal Reconstruction and Multi-Channel Conformer

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality…

Sound · Computer Science 2024-09-10 Bing Yang , Xiaofei Li

Incorporating Spatial Cues in Modular Speaker Diarization for Multi-channel Multi-party Meetings

Although fully end-to-end speaker diarization systems have made significant progress in recent years, modular systems often achieve superior results in real-world scenarios due to their greater adaptability and robustness. Historically,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-26 Ruoyu Wang , Shutong Niu , Gaobin Yang , Jun Du , Shuangqing Qian , Tian Gao , Jia Pan

Speech Separation using Neural Audio Codecs with Embedding Loss

Neural audio codecs have revolutionized audio processing by enabling speech tasks to be performed on highly compressed representations. Recent work has shown that speech separation can be achieved within these compressed domains, offering…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-28 Jia Qi Yip , Chin Yuen Kwok , Bin Ma , Eng Siong Chng

Efficient Multi-Channel Speech Enhancement with Spherical Harmonics Injection for Directional Encoding

Multi-channel speech enhancement extracts speech using multiple microphones that capture spatial cues. Effectively utilizing directional information is key for multi-channel enhancement. Deep learning shows great potential on multi-channel…

Sound · Computer Science 2023-09-21 Jiahui Pan , Pengjie Shen , Hui Zhang , Xueliang Zhang

SecoustiCodec: Cross-Modal Aligned Streaming Single-Codecbook Speech Codec

Speech codecs serve as a crucial bridge in unifying speech and text language models. Existing codec methods face several challenges in semantic encoding, such as residual paralinguistic information (e.g., timbre, emotion), insufficient…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-06 Chunyu Qiang , Haoyu Wang , Cheng Gong , Tianrui Wang , Ruibo Fu , Tao Wang , Ruilong Chen , Jiangyan Yi , Zhengqi Wen , Chen Zhang , Longbiao Wang , Jianwu Dang , Jianhua Tao

A Neural Speech Codec for Noise Robust Speech Coding

This paper considers the joint compression and enhancement problem for speech signal in the presence of noise. Recently, the SoundStream codec, which relies on end-to-end joint training of an encoder-decoder pair and a residual vector…

Sound · Computer Science 2025-09-03 Jiayi Huang , Zeyu Yan , Wenbin Jiang , He Wang , Fei Wen

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous…

Machine Learning · Computer Science 2025-10-28 Luca Della Libera , Francesco Paissan , Cem Subakan , Mirco Ravanelli

DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation

Neural audio codecs form the foundational building blocks for language model (LM)-based speech generation. Typically, there is a trade-off between frame rate and audio quality. This study introduces a low-frame-rate, semantically enhanced…

Sound · Computer Science 2025-10-02 Jiaqi Li , Xiaolong Lin , Zhekai Li , Shixi Huang , Yuancheng Wang , Chaoren Wang , Zhenpeng Zhan , Zhizheng Wu

A High Fidelity and Low Complexity Neural Audio Coding

Audio coding is an essential module in the real-time communication system. Neural audio codecs can compress audio samples with a low bitrate due to the strong modeling and generative capabilities of deep neural networks. To address the poor…

Sound · Computer Science 2023-10-18 Wenzhe Liu , Wei Xiao , Meng Wang , Shan Yang , Yupeng Shi , Yuyong Kang , Dan Su , Shidong Shang , Dong Yu

SACodec: Asymmetric Quantization with Semantic Anchoring for Low-Bitrate High-Fidelity Neural Speech Codecs

Neural Speech Codecs face a fundamental trade-off at low bitrates: preserving acoustic fidelity often compromises semantic richness. To address this, we introduce SACodec, a novel codec built upon an asymmetric dual-quantizer that employs…

Sound · Computer Science 2025-12-25 Zhongren Dong , Bin Wang , Jing Han , Haotian Guo , Xiaojun Mo , Yimin Cao , Zixing Zhang

Spatial Voice Conversion: Voice Conversion Preserving Spatial Information and Non-target Signals

This paper proposes a new task called spatial voice conversion, which aims to convert a target voice while preserving spatial information and non-target signals. Traditional voice conversion methods focus on single-channel waveforms,…

Sound · Computer Science 2024-06-26 Kentaro Seki , Shinnosuke Takamichi , Norihiro Takamune , Yuki Saito , Kanami Imamura , Hiroshi Saruwatari

BANC: Towards Efficient Binaural Audio Neural Codec for Overlapping Speech

We introduce BANC, a neural binaural audio codec designed for efficient speech compression in single and two-speaker scenarios while preserving the spatial location information of each speaker. Our key contributions are as follows: 1) The…

Sound · Computer Science 2024-11-26 Anton Ratnarajah , Shi-Xiong Zhang , Dong Yu

SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound

Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modelling techniques to audio data. However, traditional codecs…

Sound · Computer Science 2024-12-02 Haohe Liu , Xuenan Xu , Yi Yuan , Mengyue Wu , Wenwu Wang , Mark D. Plumbley

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds,…

Sound · Computer Science 2025-09-12 Xiaoxue Luo , Jinwei Huang , Runyan Yang , Yingying Gao , Junlan Feng , Chao Deng , Shilei Zhang

Quantifying Spatial Audio Quality Impairment

Spatial audio quality is a highly multifaceted concept, with many interactions between environmental, geometrical, anatomical, psychological, and contextual considerations. Methods for characterization or evaluation of the geometrical…

Audio and Speech Processing · Electrical Eng. & Systems 2024-08-27 Karn N. Watcharasupat , Alexander Lerch

Bringing Interpretability to Neural Audio Codecs

The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-19 Samir Sadok , Julien Hauret , Éric Bavu