Related papers: Linear-Complexity Self-Supervised Learning for Spe…

An Analysis of Linear Complexity Attention Substitutes with BEST-RQ

Self-Supervised Learning (SSL) has proven to be effective in various domains, including speech processing. However, SSL is computationally and memory expensive. This is in part due the quadratic complexity of multi-head self-attention…

Machine Learning · Computer Science 2024-09-05 Ryan Whetten , Titouan Parcollet , Adel Moumen , Marco Dinarelli , Yannick Estève

Linear Time Complexity Conformers with SummaryMixing for Streaming Speech Recognition

Automatic speech recognition (ASR) with an encoder equipped with self-attention, whether streaming or non-streaming, takes quadratic time in the length of the speech utterance. This slows down training and decoding, increase their cost, and…

Sound · Computer Science 2024-09-12 Titouan Parcollet , Rogier van Dalen , Shucong Zhang , Sourav Batthacharya

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances…

Audio and Speech Processing · Electrical Eng. & Systems 2026-02-11 Aditya Srinivas Menon , Kumud Tripathi , Raj Gohil , Pankaj Wasnik

Match to Win: Analysing Sequences Lengths for Efficient Self-supervised Learning in Speech and Audio

Self-supervised learning (SSL) has proven vital in speech and audio-related applications. The paradigm trains a general model on unlabeled data that can later be used to solve specific downstream tasks. This type of model is costly to train…

Sound · Computer Science 2022-11-23 Yan Gao , Javier Fernandez-Marques , Titouan Parcollet , Pedro P. B. de Gusmao , Nicholas D. Lane

SummaryMixing: A Linear-Complexity Alternative to Self-Attention for Speech Recognition and Understanding

Modern speech processing systems rely on self-attention. Unfortunately, token mixing with self-attention takes quadratic time in the length of the speech utterance, slowing down inference and training and increasing memory consumption.…

Computation and Language · Computer Science 2024-07-12 Titouan Parcollet , Rogier van Dalen , Shucong Zhang , Sourav Bhattacharya

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-08 Genshun Wan , Tan Liu , Hang Chen , Jia Pan , Cong Liu , Zhongfu Ye

Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning

Multilingual self-supervised learning (SSL) has often lagged behind state-of-the-art (SOTA) methods due to the expenses and complexity required to handle many languages. This further harms the reproducibility of SSL, which is already…

Computation and Language · Computer Science 2023-09-29 William Chen , Jiatong Shi , Brian Yan , Dan Berrebbi , Wangyou Zhang , Yifan Peng , Xuankai Chang , Soumi Maiti , Shinji Watanabe

A Pre-training Framework that Encodes Noise Information for Speech Quality Assessment

Self-supervised learning (SSL) has grown in interest within the speech processing community, since it produces representations that are useful for many downstream tasks. SSL uses global and contextual methods to produce robust…

Audio and Speech Processing · Electrical Eng. & Systems 2024-11-08 Subrina Sultana , Donald S. Williamson

Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-05 Po-chun Hsu , Ali Elkahky , Wei-Ning Hsu , Yossi Adi , Tu Anh Nguyen , Jade Copet , Emmanuel Dupoux , Hung-yi Lee , Abdelrahman Mohamed

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is…

Computation and Language · Computer Science 2024-06-14 Amit Meghanani , Thomas Hain

Realizing Petabyte Scale Acoustic Modeling

Large scale machine learning (ML) systems such as the Alexa automatic speech recognition (ASR) system continue to improve with increasing amounts of manually transcribed training data. Instead of scaling manual transcription to impractical…

Sound · Computer Science 2019-04-25 Sree Hari Krishnan Parthasarathi , Nitin Sivakrishnan , Pranav Ladkat , Nikko Strom

GMSA: Enhancing Context Compression via Group Merging and Layer Semantic Alignment

Large Language Models (LLMs) have achieved remarkable performance across a wide range of Natural Language Processing (NLP) tasks. However, in long-context scenarios, they face two challenges: high computational cost and information…

Computation and Language · Computer Science 2026-02-10 Jiwei Tang , Zhicheng Zhang , Shunlong Wu , Jingheng Ye , Lichen Bai , Zitai Wang , Tingwei Lu , Lin Hai , Yiming Zhao , Hai-Tao Zheng , Hong-Gee Kim

Evaluating Self-Supervised Speech Models via Text-Based LLMS

Self-Supervised Learning (SSL) has gained traction for its ability to learn rich representations with low labeling costs, applicable across diverse downstream tasks. However, assessing the downstream-task performance remains challenging due…

Sound · Computer Science 2025-10-07 Takashi Maekaku , Keita Goto , Jinchuan Tian , Yusuke Shinohara , Shinji Watanabe

Towards Early Prediction of Self-Supervised Speech Model Performance

In Self-Supervised Learning (SSL), pre-training and evaluation are resource intensive. In the speech domain, current indicators of the quality of SSL models during pre-training, such as the loss, do not correlate well with downstream…

Sound · Computer Science 2025-06-03 Ryan Whetten , Lucas Maison , Titouan Parcollet , Marco Dinarelli , Yannick Estève

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity,…

Computation and Language · Computer Science 2022-11-23 Sanyuan Chen , Chengyi Wang , Zhengyang Chen , Yu Wu , Shujie Liu , Zhuo Chen , Jinyu Li , Naoyuki Kanda , Takuya Yoshioka , Xiong Xiao , Jian Wu , Long Zhou , Shuo Ren , Yanmin Qian , Yao Qian , Jian Wu , Michael Zeng , Xiangzhan Yu , Furu Wei

Simultaneous or Sequential Training? How Speech Representations Cooperate in a Multi-Task Self-Supervised Learning System

Speech representation learning with self-supervised algorithms has resulted in notable performance boosts in many downstream tasks. Recent work combined self-supervised learning (SSL) and visually grounded speech (VGS) processing mechanisms…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-08 Khazar Khorrami , María Andrea Cruz Blandón , Tuomas Virtanen , Okko Räsänen

Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning

Recent years have witnessed significant advancements in self-supervised learning (SSL) methods for speech-processing tasks. Various speech-based SSL models have been developed and present promising performance on a range of downstream tasks…

Computation and Language · Computer Science 2023-10-02 Guanrou Yang , Ziyang Ma , Zhisheng Zheng , Yakun Song , Zhikang Niu , Xie Chen

MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention…

Computation and Language · Computer Science 2026-03-03 MiniCPM Team , Wenhao An , Yingfa Chen , Yewei Fang , Jiayi Li , Xin Li , Yaohui Li , Yishan Li , Yuxuan Li , Biyuan Lin , Chuan Liu , Hezi Liu , Siyuan Liu , Hongya Lyu , Yinxu Pan , Shixin Ren , Xingyu Shen , Zhou Su , Haojun Sun , Yangang Sun , Zhen Leng Thai , Xin Tian , Rui Wang , Xiaorong Wang , Yudong Wang , Bo Wu , Xiaoyue Xu , Dong Xu , Shuaikang Xue , Jiawei Yang , Bowen Zhang , Jinqian Zhang , Letian Zhang , Shengnan Zhang , Xinyu Zhang , Xinyuan Zhang , Zhu Zhang , Hengyu Zhao , Jiacheng Zhao , Zhi Zheng , Jie Zhou , Zihan Zhou , Shuo Wang , Chaojun Xiao , Xu Han , Zhiyuan Liu , Maosong Sun

An Efficient Self-Supervised Cross-View Training For Sentence Embedding

Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a…

Computation and Language · Computer Science 2023-11-07 Peerat Limkonchotiwat , Wuttikorn Ponwitayarat , Lalita Lowphansirikul , Can Udomcharoenchaikit , Ekapol Chuangsuwanich , Sarana Nutanong

Scaling Language-Free Visual Representation Learning

Visual Self-Supervised Learning (SSL) currently underperforms Contrastive Language-Image Pretraining (CLIP) in multimodal settings such as Visual Question Answering (VQA). This multimodal gap is often attributed to the semantics introduced…

Computer Vision and Pattern Recognition · Computer Science 2025-04-02 David Fan , Shengbang Tong , Jiachen Zhu , Koustuv Sinha , Zhuang Liu , Xinlei Chen , Michael Rabbat , Nicolas Ballas , Yann LeCun , Amir Bar , Saining Xie