Related papers: Textually Pretrained Speech Language Models

Scaling Speech-Text Pre-training with Synthetic Interleaved Data

Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction compared to text-based large language models (LLMs). Traditional approaches for developing SpeechLMs are…

Computation and Language · Computer Science 2024-12-03 Aohan Zeng , Zhengxiao Du , Mingdao Liu , Lei Zhang , Shengmin Jiang , Yuxiao Dong , Jie Tang

Recent Advances in Speech Language Models: A Survey

Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based…

Computation and Language · Computer Science 2025-08-08 Wenqian Cui , Dianzhi Yu , Xiaoqi Jiao , Ziqiao Meng , Guangyan Zhang , Qichao Wang , Yiwen Guo , Irwin King

Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings

Speech data has rich acoustic and paralinguistic information with important cues for understanding a speaker's tone, emotion, and intent, yet traditional large language models such as BERT do not incorporate this information. There has been…

Computation and Language · Computer Science 2023-11-14 Fatema Hasan , Yulong Li , James Foulds , Shimei Pan , Bishwaranjan Bhattacharjee

LAST: Language Model Aware Speech Tokenization

Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of…

Computation and Language · Computer Science 2024-09-11 Arnon Turetzky , Yossi Adi

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model…

Computation and Language · Computer Science 2023-06-16 Ziqiang Zhang , Sanyuan Chen , Long Zhou , Yu Wu , Shuo Ren , Shujie Liu , Zhuoyuan Yao , Xun Gong , Lirong Dai , Jinyu Li , Furu Wei

SLM: Bridge the thin gap between speech and text foundation models

We present a joint Speech and Language Model (SLM), a multitask, multilingual, and dual-modal model that takes advantage of pretrained foundational speech and language models. SLM freezes the pretrained foundation models to maximally…

Computation and Language · Computer Science 2023-10-03 Mingqiu Wang , Wei Han , Izhak Shafran , Zelin Wu , Chung-Cheng Chiu , Yuan Cao , Yongqiang Wang , Nanxin Chen , Yu Zhang , Hagen Soltau , Paul Rubenstein , Lukas Zilka , Dian Yu , Zhong Meng , Golan Pundak , Nikhil Siddhartha , Johan Schalkwyk , Yonghui Wu

Textless Speech-to-Speech Translation on Real Data

We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language and can be built without the need of any text data. Different from existing work in the literature, we tackle…

Computation and Language · Computer Science 2022-05-06 Ann Lee , Hongyu Gong , Paul-Ambroise Duquenne , Holger Schwenk , Peng-Jen Chen , Changhan Wang , Sravya Popuri , Yossi Adi , Juan Pino , Jiatao Gu , Wei-Ning Hsu

Interleaved Speech-Text Language Models for Simple Streaming Text-to-Speech Synthesis

This paper introduces Interleaved Speech-Text Language Model (IST-LM) for zero-shot streaming Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-12 Yifan Yang , Shujie Liu , Jinyu Li , Hui Wang , Lingwei Meng , Haiyang Sun , Yuzhe Liang , Ziyang Ma , Yuxuan Hu , Rui Zhao , Jianwei Yu , Yan Lu , Xie Chen

Scheduled Interleaved Speech-Text Training for Speech-to-Speech Translation with LLMs

Speech-to-speech translation (S2ST) has been advanced with large language models (LLMs), which are fine-tuned on discrete speech units. In such approaches, modality adaptation from text to speech has been an issue. LLMs are trained on…

Computation and Language · Computer Science 2025-06-13 Hayato Futami , Emiru Tsunoo , Yosuke Kashiwagi , Yuki Ito , Hassan Shahmohammadi , Siddhant Arora , Shinji Watanabe

WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation

Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing…

Machine Learning · Computer Science 2026-03-06 Luca Della Libera , Cem Subakan , Mirco Ravanelli

Speech Translation with Large Language Models: An Industrial Practice

Given the great success of large language models (LLMs) across various tasks, in this paper, we introduce LLM-ST, a novel and effective speech translation model constructed upon a pre-trained LLM. By integrating the large language model…

Computation and Language · Computer Science 2023-12-22 Zhichao Huang , Rong Ye , Tom Ko , Qianqian Dong , Shanbo Cheng , Mingxuan Wang , Hang Li

Cross-Lingual Interleaving for Speech Language Models

Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress…

Computation and Language · Computer Science 2026-02-23 Adel Moumen , Guangzhi Sun , Philip C. Woodland

Leveraging Pre-trained Language Model for Speech Sentiment Analysis

In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. First, we investigate how useful a pre-trained language model would be in a 2-step pipeline…

Computation and Language · Computer Science 2021-06-15 Suwon Shon , Pablo Brusco , Jing Pan , Kyu J. Han , Shinji Watanabe

MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation

There have been emerging research interest and advances in speech-to-speech translation (S2ST), translating utterances from one language to another. This work proposes Multitask Speech Language Model (MSLM), which is a decoder-only speech…

Computation and Language · Computer Science 2024-03-20 Yifan Peng , Ilia Kulikov , Yilin Yang , Sravya Popuri , Hui Lu , Changhan Wang , Hongyu Gong

Scaling Analysis of Interleaved Speech-Text Language Models

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. It predicts that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high-quality SLMs. However, modern…

Computation and Language · Computer Science 2025-07-29 Gallil Maimon , Michael Hassid , Amit Roth , Yossi Adi

SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought

Expressive speech-to-speech translation (S2ST) is a key research topic in seamless communication, which focuses on the preservation of semantics and speaker vocal style in translated speech. Early works synthesized speaker style aligned…

Computation and Language · Computer Science 2024-06-03 Hongyu Gong , Bandhav Veluri

SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more…

Computation and Language · Computer Science 2026-04-21 Sirry Chen , Jieyi Wang , Wei Chen , Zhongyu Wei

TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling

Recent efforts target spoken language models (SLMs) that not only listen but also speak for more natural human-LLM interaction. Joint speech-text modeling is a promising direction to achieve this. However, the effectiveness of recent speech…

Computation and Language · Computer Science 2026-02-06 Liang-Hsuan Tseng , Yi-Chang Chen , Kuan-Yi Lee , Da-Shan Shiu , Hung-yi Lee

DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data

Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs) by incorporating pre-trained speech models. However, these SLMs often undergo extensive speech instruction-tuning to bridge…

Audio and Speech Processing · Electrical Eng. & Systems 2025-07-30 Ke-Han Lu , Zhehuai Chen , Szu-Wei Fu , Chao-Han Huck Yang , Jagadeesh Balam , Boris Ginsburg , Yu-Chiang Frank Wang , Hung-yi Lee

Speech Model Pre-training for End-to-End Spoken Language Understanding

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these…

Audio and Speech Processing · Electrical Eng. & Systems 2019-07-26 Loren Lugosch , Mirco Ravanelli , Patrick Ignoto , Vikrant Singh Tomar , Yoshua Bengio