Related papers: Codec2Vec: Self-Supervised Speech Representation L…

Speech Separation using Neural Audio Codecs with Embedding Loss

Neural audio codecs have revolutionized audio processing by enabling speech tasks to be performed on highly compressed representations. Recent work has shown that speech separation can be achieved within these compressed domains, offering…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-28 Jia Qi Yip , Chin Yuen Kwok , Bin Ma , Eng Siong Chng

AV-data2vec: Self-supervised Learning of Audio-Visual Speech Representations with Contextualized Target Representations

Self-supervision has shown great potential for audio-visual speech recognition by vastly reducing the amount of labeled data required to build good systems. However, existing methods are either not entirely end-to-end or do not train joint…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-23 Jiachen Lian , Alexei Baevski , Wei-Ning Hsu , Michael Auli

Codec-ASR: Training Performant Automatic Speech Recognition Systems with Discrete Speech Representations

Discrete speech representations have garnered recent attention for their efficacy in training transformer-based models for various speech-related tasks such as automatic speech recognition (ASR), translation, speaker verification, and joint…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-26 Kunal Dhawan , Nithin Rao Koluguri , Ante Jukić , Ryan Langman , Jagadeesh Balam , Boris Ginsburg

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes…

Machine Learning · Computer Science 2023-06-16 Alexei Baevski , Arun Babu , Wei-Ning Hsu , Michael Auli

Wav2vec-C: A Self-supervised Model for Speech Representation Learning

Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way…

Audio and Speech Processing · Electrical Eng. & Systems 2021-06-25 Samik Sadhu , Di He , Che-Wei Huang , Sri Harish Mallidi , Minhua Wu , Ariya Rastrow , Andreas Stolcke , Jasha Droppo , Roland Maas

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have…

Sound · Computer Science 2022-01-10 Sangeeta Srivastava , Yun Wang , Andros Tjandra , Anurag Kumar , Chunxi Liu , Kritika Singh , Yatharth Saraf

Improving Speech Decoding from ECoG with Self-Supervised Pretraining

Recent work on intracranial brain-machine interfaces has demonstrated that spoken speech can be decoded with high accuracy, essentially by treating the problem as an instance of supervised learning and training deep neural networks to map…

Neurons and Cognition · Quantitative Biology 2024-05-30 Brian A. Yuan , Joseph G. Makin

ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers

Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However, existing neural codecs often trade model complexity for reconstruction performance. These codecs…

Sound · Computer Science 2024-10-04 Yuzhe Gu , Enmao Diao

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised…

Machine Learning · Computer Science 2022-10-27 Alexei Baevski , Wei-Ning Hsu , Qiantong Xu , Arun Babu , Jiatao Gu , Michael Auli

Bringing Interpretability to Neural Audio Codecs

The advent of neural audio codecs has increased in popularity due to their potential for efficiently modeling audio with transformers. Such advanced codecs represent audio from a highly continuous waveform to low-sampled discrete units. In…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-19 Samir Sadok , Julien Hauret , Éric Bavu

Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations

Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm.…

Sound · Computer Science 2025-10-16 Xue Jiang , Xiulian Peng , Yuan Zhang , Yan Lu

Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding

Neural speech codecs have demonstrated their ability to compress high-quality speech and audio by converting them into discrete token representations. Most existing methods utilize Residual Vector Quantization (RVQ) to encode speech into…

Sound · Computer Science 2024-10-22 Peiji Yang , Fengping Wang , Yicheng Zhong , Huawei Wei , Zhisheng Wang

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

Universal audio codecs learn entangled representations across audio types, whereas some specific codecs offer decoupled representations but are limited to speech. Real-world audio, however, often contains mixed speech and background sounds,…

Sound · Computer Science 2025-09-12 Xiaoxue Luo , Jinwei Huang , Runyan Yang , Yingying Gao , Junlan Feng , Chao Deng , Shilei Zhang

Representation Learning With Hidden Unit Clustering For Low Resource Speech Applications

The representation learning of speech, without textual resources, is an area of significant interest for many low resource speech applications. In this paper, we describe an approach to self-supervised representation learning from raw audio…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-17 Varun Krishna , Tarun Sai , Sriram Ganapathy

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

In this paper, we propose a novel deep neural network architecture, Speech2Vec, for learning fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain semantic information pertaining to…

Computation and Language · Computer Science 2018-06-12 Yu-An Chung , James Glass

RepCodec: A Speech Representation Codec for Speech Tokenization

With recent rapid growth of large language models (LLMs), discrete speech tokenization has played an important role for injecting speech into LLMs. However, this discretization gives rise to a loss of information, consequently impairing…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-23 Zhichao Huang , Chutong Meng , Tom Ko

CoBERT: Self-Supervised Speech Representation Learning Through Code Representation Learning

Speech is the surface form of a finite set of phonetic units, which can be represented by discrete codes. We propose the Code BERT (CoBERT) approach for self-supervised speech representation learning. The idea is to convert an utterance to…

Sound · Computer Science 2023-07-06 Chutong Meng , Junyi Ao , Tom Ko , Mingxuan Wang , Haizhou Li

Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models

Neural audio codec models are becoming increasingly important as they serve as tokenizers for audio, enabling efficient transmission or facilitating speech language modeling. The ideal neural audio codec should maintain content,…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-24 Haibin Wu , Xuanjun Chen , Yi-Cheng Lin , Kaiwei Chang , Jiawei Du , Ke-Han Lu , Alexander H. Liu , Ho-Lam Chung , Yuan-Kuei Wu , Dongchao Yang , Songxiang Liu , Yi-Chiao Wu , Xu Tan , James Glass , Shinji Watanabe , Hung-yi Lee

Learning Word Embeddings from Speech

In this paper, we propose a novel deep neural network architecture, Sequence-to-Sequence Audio2Vec, for unsupervised learning of fixed-length vector representations of audio segments excised from a speech corpus, where the vectors contain…

Computation and Language · Computer Science 2017-11-07 Yu-An Chung , James Glass

EnCodecMAE: Leveraging neural codecs for universal audio representation learning

The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on…

Sound · Computer Science 2024-05-22 Leonardo Pepino , Pablo Riera , Luciana Ferrer