Related papers: Layer Reduction: Accelerating Conformer-Based Self…

Exploring Heterogeneous Characteristics of Layers in ASR Models for More Efficient Training

Transformer-based architectures have been the subject of research aimed at understanding their overparameterization and the non-uniform importance of their layers. Applying these approaches to Automatic Speech Recognition, we demonstrate…

Machine Learning · Computer Science 2022-02-07 Lillian Zhou , Dhruv Guliani , Andreas Kabel , Giovanni Motta , Françoise Beaufays

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-17 Chengyi Wang , Yu Wu , Sanyuan Chen , Shujie Liu , Jinyu Li , Yao Qian , Zhenglu Yang

Fine-tuning Strategies for Faster Inference using Speech Self-Supervised Models: A Comparative Study

Self-supervised learning (SSL) has allowed substantial progress in Automatic Speech Recognition (ASR) performance in low-resource settings. In this context, it has been demonstrated that larger self-supervised feature extractors are crucial…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-14 Salah Zaiem , Robin Algayres , Titouan Parcollet , Slim Essid , Mirco Ravanelli

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

Recently, Transformer-based language models have demonstrated remarkable performance across many NLP domains. However, the unsupervised pre-training step of these models suffers from unbearable overall computational expenses. Current…

Machine Learning · Computer Science 2020-10-27 Minjia Zhang , Yuxiong He

Parameter-Efficient Transformer Embeddings

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi

Exploring Efficient-tuning Methods in Self-supervised Speech Models

In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-31 Zih-Ching Chen , Chin-Lun Fu , Chih-Ying Liu , Shang-Wen Li , Hung-yi Lee

LASER: Learning by Aligning Self-supervised Representations of Speech for Improving Content-related Tasks

Self-supervised learning (SSL)-based speech models are extensively used for full-stack speech processing. However, it has been observed that improving SSL-based speech representations using unlabeled speech for content-related tasks is…

Computation and Language · Computer Science 2024-06-14 Amit Meghanani , Thomas Hain

How Redundant Is the Transformer Stack in Speech Representation Models?

Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-01-20 Teresa Dorszewski , Albert Kjøller Jacobsen , Lenka Tětková , Lars Kai Hansen

Bridging the Granularity Gap for Acoustic Modeling

While Transformer has become the de-facto standard for speech, modeling upon the fine-grained frame-level features remains an open challenge of capturing long-distance dependencies and distributing the attention weights. We propose…

Computation and Language · Computer Science 2023-05-30 Chen Xu , Yuhao Zhang , Chengbo Jiao , Xiaoqian Liu , Chi Hu , Xin Zeng , Tong Xiao , Anxiang Ma , Huizhen Wang , JingBo Zhu

Emergent Properties of Finetuned Language Representation Models

Large, self-supervised transformer-based language representation models have recently received significant amounts of attention, and have produced state-of-the-art results across a variety of tasks simply by scaling up pre-training on…

Computation and Language · Computer Science 2019-10-25 Alexandre Matton , Luke de Oliveira

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models

Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational costs. To better understand LLM scaling, we study representational differences between LLMs and…

Machine Learning · Computer Science 2026-05-06 Chen Liu , Xingzhi Sun , Xi Xiao , Alexandre Van Tassel , Ke Xu , Kristof Reimann , Danqi Liao , Mark Gerstein , Tianyang Wang , Xiao Wang , Smita Krishnaswamy

CHAPTER: Exploiting Convolutional Neural Network Adapters for Self-supervised Speech Models

Self-supervised learning (SSL) is a powerful technique for learning representations from unlabeled data. Transformer based models such as HuBERT, which consist a feature extractor and transformer layers, are leading the field in the speech…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-23 Zih-Ching Chen , Yu-Shun Sung , Hung-yi Lee

Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning with Self-Knowledge Distillation

End-to-end automatic speech recognition (ASR), unlike conventional ASR, does not have modules to learn the semantic representation from speech encoder. Moreover, the higher frame-rate of speech representation prevents the model to learn the…

Artificial Intelligence · Computer Science 2021-03-19 Md Akmal Haidar , Chao Xing , Mehdi Rezagholizadeh

Incremental Layer-wise Self-Supervised Learning for Efficient Speech Domain Adaptation On Device

Streaming end-to-end speech recognition models have been widely applied to mobile devices and show significant improvement in efficiency. These models are typically trained on the server using transcribed speech data. However, the server…

Sound · Computer Science 2021-10-04 Zhouyuan Huo , Dongseong Hwang , Khe Chai Sim , Shefali Garg , Ananya Misra , Nikhil Siddhartha , Trevor Strohman , Françoise Beaufays

Progressive Multi-Scale Self-Supervised Learning for Speech Recognition

Self-supervised learning (SSL) models have achieved considerable improvements in automatic speech recognition (ASR). In addition, ASR performance could be further improved if the model is dedicated to audio content information learning…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-08 Genshun Wan , Tan Liu , Hang Chen , Jia Pan , Cong Liu , Zhongfu Ye

Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion

Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream…

Machine Learning · Computer Science 2022-11-08 Zhouyuan Huo , Khe Chai Sim , Bo Li , Dongseong Hwang , Tara N. Sainath , Trevor Strohman

Ultra Fast Speech Separation Model with Teacher Student Learning

Transformer has been successfully applied to speech separation recently with its strong long-dependency modeling capacity using a self-attention mechanism. However, Transformer tends to have heavy run-time costs due to the deep encoder…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-28 Sanyuan Chen , Yu Wu , Zhuo Chen , Jian Wu , Takuya Yoshioka , Shujie Liu , Jinyu Li , Xiangzhan Yu

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Anmol Gulati , James Qin , Chung-Cheng Chiu , Niki Parmar , Yu Zhang , Jiahui Yu , Wei Han , Shibo Wang , Zhengdong Zhang , Yonghui Wu , Ruoming Pang