Related papers: Enhancing Speech Recognition Decoding via Layer Ag…

Don't Be So Sure! Boosting ASR Decoding via Confidence Relaxation

Automatic Speech Recognition (ASR) systems frequently use a search-based decoding strategy aiming to find the best attainable transcript by considering multiple candidates. One prominent speech recognition decoding heuristic is beam search,…

Computation and Language · Computer Science 2022-12-29 Tomer Wullach , Shlomo E. Chazan

Selection of Layers from Self-supervised Learning Models for Predicting Mean-Opinion-Score of Speech

Self-supervised learning (SSL) models like Wav2Vec2, HuBERT, and WavLM have been widely used in speech processing. These transformer-based models consist of multiple layers, each capturing different levels of representation. While prior…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-13 Xinyu Liang , Fredrik Cumlin , Victor Ungureanu , Chandan K. A. Reddy , Christian Schuldt , Saikat Chatterjee

Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition without Length Bias

As one popular modeling approach for end-to-end speech recognition, attention-based encoder-decoder models are known to suffer the length bias and corresponding beam problem. Different approaches have been applied in simple beam search to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-24 Wei Zhou , Ralf Schlüter , Hermann Ney

Investigating the 'Autoencoder Behavior' in Speech Self-Supervised Models: a focus on HuBERT's Pretraining

Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed…

Computation and Language · Computer Science 2024-05-15 Valentin Vielzeuf

Layer-wise Analysis of a Self-supervised Speech Representation Model

Recently proposed self-supervised learning approaches have been successful for pre-training speech representation models. The utility of these learned representations has been observed empirically, but not much has been studied about the…

Computation and Language · Computer Science 2022-12-06 Ankita Pasad , Ju-Chieh Chou , Karen Livescu

Unified Hypersphere Embedding for Speaker Recognition

Incremental improvements in accuracy of Convolutional Neural Networks are usually achieved through use of deeper and more complex models trained on larger datasets. However, enlarging dataset and models increases the computation and storage…

Audio and Speech Processing · Electrical Eng. & Systems 2018-07-24 Mahdi Hajibabaei , Dengxin Dai

Augmenting BERT-style Models with Predictive Coding to Improve Discourse-level Representations

Current language models are usually trained using a self-supervised scheme, where the main focus is learning representations at the word or sentence level. However, there has been limited progress in generating useful discourse-level…

Computation and Language · Computer Science 2021-09-13 Vladimir Araujo , Andrés Villa , Marcelo Mendoza , Marie-Francine Moens , Alvaro Soto

Faster Speech-LLaMA Inference with Multi-token Prediction

Large language models (LLMs) have become proficient at solving a wide variety of tasks, including those involving multi-modal inputs. In particular, instantiating an LLM (such as LLaMA) with a speech encoder and training it on paired data…

Audio and Speech Processing · Electrical Eng. & Systems 2024-09-13 Desh Raj , Gil Keren , Junteng Jia , Jay Mahadeokar , Ozlem Kalinli

Enhancing Speaker Diarization with Large Language Models: A Contextual Beam Search Approach

Large language models (LLMs) have shown great promise for capturing contextual information in natural language processing tasks. We propose a novel approach to speaker diarization that incorporates the prowess of LLMs to exploit contextual…

Audio and Speech Processing · Electrical Eng. & Systems 2023-09-15 Tae Jin Park , Kunal Dhawan , Nithin Koluguri , Jagadeesh Balam

Deliberation Model Based Two-Pass End-to-End Speech Recognition

End-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) and perform competitively relative to conventional models. To further improve the quality, a two-pass model has been proposed to rescore streamed…

Audio and Speech Processing · Electrical Eng. & Systems 2020-03-19 Ke Hu , Tara N. Sainath , Ruoming Pang , Rohit Prabhavalkar

Investigating Multi-layer Representations for Dense Passage Retrieval

Dense retrieval models usually adopt vectors from the last hidden layer of the document encoder to represent a document, which is in contrast to the fact that representations in different layers of a pre-trained language model usually…

Information Retrieval · Computer Science 2025-09-30 Zhongbin Xie , Thomas Lukasiewicz

Probing Acoustic Representations for Phonetic Properties

Pre-trained acoustic representations such as wav2vec and DeCoAR have attained impressive word error rates (WER) for speech recognition benchmarks, particularly when labeled data is limited. But little is known about what phonetic properties…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-16 Danni Ma , Neville Ryant , Mark Liberman

Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference

Attention-based encoder-decoder models with autoregressive (AR) decoding have proven to be the dominant approach for automatic speech recognition (ASR) due to their superior accuracy. However, they often suffer from slow inference. This is…

Audio and Speech Processing · Electrical Eng. & Systems 2024-02-13 Masao Someki , Nicholas Eng , Yosuke Higuchi , Shinji Watanabe

Self-Supervised Learning for speech recognition with Intermediate layer supervision

Recently, pioneer work finds that speech pre-trained models can solve full-stack speech processing tasks, because the model utilizes bottom layers to learn speaker-related information and top layers to encode content-related information.…

Audio and Speech Processing · Electrical Eng. & Systems 2021-12-17 Chengyi Wang , Yu Wu , Sanyuan Chen , Shujie Liu , Jinyu Li , Yao Qian , Zhenglu Yang

Improving Embedding Extraction for Speaker Verification with Ladder Network

Speaker verification is an established yet challenging task in speech processing and a very vibrant research area. Recent speaker verification (SV) systems rely on deep neural networks to extract high-level embeddings which are able to…

Audio and Speech Processing · Electrical Eng. & Systems 2020-03-23 Fei Tao , Gokhan Tur

Enhancing Neural Spoken Language Recognition: An Exploration with Multilingual Datasets

In this research, we advanced a spoken language recognition system, moving beyond traditional feature vector-based models. Our improvements focused on effectively capturing language characteristics over extended periods using a specialized…

Sound · Computer Science 2025-01-22 Or Haim Anidjar , Roi Yozevitch

A Deep Decoder Structure Based on WordEmbedding Regression for An Encoder-Decoder Based Model for Image Captioning

Generating textual descriptions for images has been an attractive problem for the computer vision and natural language processing researchers in recent years. Dozens of models based on deep learning have been proposed to solve this problem.…

Computer Vision and Pattern Recognition · Computer Science 2019-07-01 Ahmad Asadi , Reza Safabakhsh

Finding consensus in speech recognition: word error minimization and other applications of confusion networks

We describe a new framework for distilling information from word lattices to improve the accuracy of speech recognition and obtain a more perspicuous representation of a set of alternative hypotheses. In the standard MAP decoding approach…

Computation and Language · Computer Science 2022-02-28 L. Mangu , E. Brill , A. Stolcke

Decoding Imagined Speech using Wavelet Features and Deep Neural Networks

This paper proposes a novel approach that uses deep neural networks for classifying imagined speech, significantly increasing the classification accuracy. The proposed approach employs only the EEG channels over specific areas of the brain…

Neurons and Cognition · Quantitative Biology 2020-03-24 Jerrin Thomas Panachakel , A. G. Ramakrishnan , A. G. Ramakrishnan

Scaling Up Deliberation for Multilingual ASR

Multilingual end-to-end automatic speech recognition models are attractive due to its simplicity in training and deployment. Recent work on large-scale training of such models has shown promising results compared to monolingual models.…

Computation and Language · Computer Science 2022-10-13 Ke Hu , Bo Li , Tara N. Sainath