Related papers: Text-Utilization for Encoder-dominated Speech Reco…

A Comparison of Techniques for Language Model Integration in Encoder-Decoder Speech Recognition

Attention-based recurrent neural encoder-decoder models present an elegant solution to the automatic speech recognition problem. This approach folds the acoustic model, pronunciation model, and language model into a single network and…

Audio and Speech Processing · Electrical Eng. & Systems 2018-11-08 Shubham Toshniwal , Anjuli Kannan , Chung-Cheng Chiu , Yonghui Wu , Tara N Sainath , Karen Livescu

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Collecting audio-text pairs is expensive; however, it is much easier to access text-only data. Unless using shallow fusion, end-to-end automatic speech recognition (ASR) models require architecture modifications or additional training…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-10 Emiru Tsunoo , Hayato Futami , Yosuke Kashiwagi , Siddhant Arora , Shinji Watanabe

Few-Shot Spoken Language Understanding via Joint Speech-Text Models

Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared…

Computation and Language · Computer Science 2023-10-10 Chung-Ming Chien , Mingjiamei Zhang , Ju-Chieh Chou , Karen Livescu

On decoder-only architecture for speech-to-text and large language model integration

Large language models (LLMs) have achieved remarkable success in the field of natural language processing, enabling better human-computer interaction using natural language. However, the seamless integration of speech signals into LLMs has…

Audio and Speech Processing · Electrical Eng. & Systems 2023-10-03 Jian Wu , Yashesh Gaur , Zhuo Chen , Long Zhou , Yimeng Zhu , Tianrui Wang , Jinyu Li , Shujie Liu , Bo Ren , Linquan Liu , Yu Wu

Improving End-to-end Speech Translation by Leveraging Auxiliary Speech and Text Data

We present a method for introducing a text encoder into pre-trained end-to-end speech translation systems. It enhances the ability of adapting one modality (i.e., source-language speech) to another (i.e., source-language text). Thus, the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-06 Yuhao Zhang , Chen Xu , Bojie Hu , Chunliang Zhang , Tong Xiao , Jingbo Zhu

Collaborative Training of Acoustic Encoders for Speech Recognition

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the…

Computation and Language · Computer Science 2021-07-15 Varun Nagaraja , Yangyang Shi , Ganesh Venkatesh , Ozlem Kalinli , Michael L. Seltzer , Vikas Chandra

Condenser: a Pre-training Architecture for Dense Retrieval

Pre-trained Transformer language models (LM) have become go-to text representation encoders. Prior research fine-tunes deep LMs to encode text sequences such as sentences and passages into single dense vector representations for efficient…

Computation and Language · Computer Science 2021-09-22 Luyu Gao , Jamie Callan

Low-Resource Speech-to-Text Translation

Speech-to-text translation has many potential applications for low-resource languages, but the typical approach of cascading speech recognition with machine translation is often impossible, since the transcripts needed to train a speech…

Computation and Language · Computer Science 2018-06-19 Sameer Bansal , Herman Kamper , Karen Livescu , Adam Lopez , Sharon Goldwater

Less is More: Pre-train a Strong Text Encoder for Dense Retrieval Using a Weak Decoder

Dense retrieval requires high-quality text sequence embeddings to support effective search in the representation space. Autoencoder-based language models are appealing in dense retrieval as they train the encoder to output high-quality…

Machine Learning · Computer Science 2021-09-17 Shuqi Lu , Di He , Chenyan Xiong , Guolin Ke , Waleed Malik , Zhicheng Dou , Paul Bennett , Tieyan Liu , Arnold Overwijk

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

This paper presents a simple method that allows to easily enhance textual pre-trained large language models with speech information, when fine-tuned for a specific classification task. A classical issue with the fusion of many embeddings…

Computation and Language · Computer Science 2026-04-07 Nicolas Calbucura , Jose Guillen , Valentin Barriere

TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment

Recent advances in speech-enabled language models have shown promising results in building intelligent voice assistants. However, most existing approaches rely on large-scale paired speech-text data and extensive computational resources,…

Computation and Language · Computer Science 2025-06-10 Taesoo Kim , Jong Hwan Ko

An Experimental Study of LSTM Encoder-Decoder Model for Text Simplification

Text simplification (TS) aims to reduce the lexical and structural complexity of a text, while still retaining the semantic meaning. Current automatic TS techniques are limited to either lexical-level applications or manually defining a…

Computation and Language · Computer Science 2016-09-14 Tong Wang , Ping Chen , Kevin Amaral , Jipeng Qiang

Improving Neural Text Simplification Model with Simplified Corpora

Text simplification (TS) can be viewed as monolingual translation task, translating between text variations within a single language. Recent neural TS models draw on insights from neural machine translation to learn lexical simplification…

Computation and Language · Computer Science 2018-10-11 Jipeng Qiang

Unveiling the Role of Pretraining in Direct Speech Translation

Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the encoder on automatic speech recognition, hence losing efficiency in the training process. In this…

Computation and Language · Computer Science 2024-09-27 Belen Alastruey , Gerard I. Gállego , Marta R. Costa-jussà

Evaluating Small Decoder-Only Language Models for Grammar Correction and Text Simplification

Large language models have become extremely popular recently due to their ability to achieve strong performance on a variety of tasks, such as text generation and rewriting, but their size and computation cost make them difficult to access,…

Computation and Language · Computer Science 2026-01-08 Anthony Lamelas

An Encoder-Decoder Based Audio Captioning System With Transfer and Reinforcement Learning

Automated audio captioning aims to use natural language to describe the content of audio data. This paper presents an audio captioning system with an encoder-decoder architecture, where the decoder predicts words based on audio features…

Audio and Speech Processing · Electrical Eng. & Systems 2021-08-06 Xinhao Mei , Qiushi Huang , Xubo Liu , Gengyun Chen , Jingqian Wu , Yusong Wu , Jinzheng Zhao , Shengchen Li , Tom Ko , H Lilian Tang , Xi Shao , Mark D. Plumbley , Wenwu Wang

Decoding Partial Differential Equations: Cross-Modal Adaptation of Decoder-only Models to PDEs

While large language models are primarily used on natural language tasks, they have also shown great promise when adapted to new modalities, e.g., for scientific machine learning tasks. Most proposed approaches for such cross-modal…

Machine Learning · Computer Science 2026-03-09 Paloma García-de-Herreros , Philipp Slusallek , Dietrich Klakow , Vagrant Gautam

Hybrid Decoding: Rapid Pass and Selective Detailed Correction for Sequence Models

Recently, Transformer-based encoder-decoder models have demonstrated strong performance in multilingual speech recognition. However, the decoder's autoregressive nature and large size introduce significant bottlenecks during inference.…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-28 Yunkyu Lim , Jihwan Park , Hyung Yong Kim , Hanbin Lee , Byeong-Yeol Kim

Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing

Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer. In recent years, unsupervised and self-supervised techniques for learning speech representation were developed to foster automatic speech…

Computation and Language · Computer Science 2021-12-15 Pierre Beckmann , Mikolaj Kegler , Milos Cernak

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel