Related papers: Voxtral Realtime

Turning Whisper into Real-Time Transcription System

Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an…

Computation and Language · Computer Science 2023-09-22 Dominik Macháček , Raj Dabre , Ondřej Bojar

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

We present VoXtream, a fully autoregressive, zero-shot streaming text-to-speech (TTS) system for real-time use that begins speaking from the first word. VoXtream directly maps incoming phonemes to audio tokens using a monotonic alignment…

Audio and Speech Processing · Electrical Eng. & Systems 2026-01-27 Nikita Torgashov , Gustav Eje Henter , Gabriel Skantze

Voxtral TTS

We introduce Voxtral TTS, an expressive multilingual text-to-speech model that generates natural speech from as little as 3 seconds of reference audio. Voxtral TTS adopts a hybrid architecture that combines auto-regressive generation of…

Artificial Intelligence · Computer Science 2026-04-07 Mistral-AI , : , Alexander H. Liu , Alexis Tacnet , Andy Ehrenberg , Andy Lo , Chen-Yo Sun , Guillaume Lample , Henry Lagarde , Jean-Malo Delignon , Jaeyoung Kim , John Harvill , Khyathi Raghavi Chandu , Lorenzo Signoretti , Margaret Jennings , Patrick von Platen , Pavankumar Reddy Muddireddy , Rohin Arora , Sanchit Gandhi , Samuel Humeau , Soham Ghosh , Srijan Mishra , Van Phung , Abdelaziz Bounhar , Abhinav Rastogi , Adrien Sadé , Alan Jeffares , Albert Jiang , Alexandre Cahill , Alexandre Gavaudan , Alexandre Sablayrolles , Amélie Héliou , Amos You , Andrew Bai , Andrew Zhao , Angele Lenglemetz , Anmol Agarwal , Anton Eliseev , Antonia Calvi , Arjun Majumdar , Arthur Fournier , Artjom Joosen , Avi Sooriyarachchi , Aysenur Karaduman Utkur , Baptiste Bout , Baptiste Rozière , Baudouin De Monicault , Benjamin Tibi , Bowen Yang , Charlotte Cronjäger , Clémence Lanfranchi , Connor Chen , Corentin Barreau , Corentin Sautier , Cyprien Courtot , Darius Dabert , Diego de las Casas , Elizaveta Demyanenko , Elliot Chane-Sane , Emmanuel Gottlob , Enguerrand Paquin , Etienne Goffinet , Fabien Niel , Faruk Ahmed , Federico Baldassarre , Gabrielle Berrada , Gaëtan Ecrepont , Gauthier Guinet , Genevieve Hayes , Georgii Novikov , Giada Pistilli , Guillaume Kunsch , Guillaume Martin , Guillaume Raille , Gunjan Dhanuka , Gunshi Gupta , Han Zhou , Harshil Shah , Hope McGovern , Hugo Thimonier , Indraneel Mukherjee , Irene Zhang , Jacques Sun , Jan Ludziejewski , Jason Rute , Jérémie Dentan , Joachim Studnia , Jonas Amar , Joséphine Delas , Josselin Somerville Roberts , Julien Tauran , Karmesh Yadav , Kartik Khandelwal , Kilian Tep , Kush Jain , Laurence Aitchison , Laurent Fainsin , Léonard Blier , Lingxiao Zhao , Louis Martin , Lucile Saulnier , Luyu Gao , Maarten Buyl , Manan Sharma , Marie Pellat , Mark Prins , Martin Alexandre , Mathieu Poirée , Mathieu Schmitt , Mathilde Guillaumin , Matthieu Dinot , Matthieu Futeral , Maxime Darrin , Maximilian Augustin , Mert Unsal , Mia Chiquier , Mikhail Biriuchinskii , Minh-Quang Pham , Mircea Lica , Morgane Rivière , Nathan Grinsztajn , Neha Gupta , Olivier Bousquet , Olivier Duchenne , Patricia Wang , Paul Jacob , Paul Wambergue , Paula Kurylowicz , Philippe Pinel , Philomène Chagniot , Pierre Stock , Piotr Miłoś , Prateek Gupta , Pravesh Agrawal , Quentin Torroba , Ram Ramrakhya , Randall Isenhour , Rishi Shah , Romain Sauvestre , Roman Soletskyi , Rosalie Millner , Rupert Menneer , Sagar Vaze , Samuel Barry , Samuel Belkadi , Sandeep Subramanian , Sean Cha , Shashwat Verma , Siddhant Waghjale , Siddharth Gandhi , Simon Lepage , Sumukh Aithal , Szymon Antoniak , Tarun Kumar Vangani , Teven Le Scao , Théo Cachet , Theo Simon Sorg , Thibaut Lavril , Thomas Chabal , Thomas Foubert , Thomas Robert , Thomas Wang , Tim Lawson , Tom Bewley , Tom Edwards , Tyler Wang , Umar Jamil , Umberto Tomasini , Valeriia Nemychnikova , Vedant Nanda , Victor Jouault , Vincent Maladière , Vincent Pfister , Virgile Richard , Vladislav Bataev , Wassim Bouaziz , Wen-Ding Li , William Havard , William Marshall , Xinghui Li , Xingran Guo , Xinyu Yang , Yannic Neuhaus , Yassine El Ouahidi , Yassir Bendou , Yihan Wang , Yimu Pan , Zaccharie Ramzi , Zhenlin Xu

WhisperRT -- Turning Whisper into a Causal Streaming Model

Automatic Speech Recognition (ASR) has seen remarkable progress, with models like OpenAI Whisper and NVIDIA Canary achieving state-of-the-art (SOTA) performance in offline transcription. However, these models are not designed for streaming…

Computation and Language · Computer Science 2026-04-07 Tomer Krichli , Bhiksha Raj , Joseph Keshet

Voxtral

We present Voxtral Mini and Voxtral Small, two multimodal audio chat models. Voxtral is trained to comprehend both spoken audio and text documents, achieving state-of-the-art performance across a diverse range of audio benchmarks, while…

Sound · Computer Science 2025-07-18 Alexander H. Liu , Andy Ehrenberg , Andy Lo , Clément Denoix , Corentin Barreau , Guillaume Lample , Jean-Malo Delignon , Khyathi Raghavi Chandu , Patrick von Platen , Pavankumar Reddy Muddireddy , Sanchit Gandhi , Soham Ghosh , Srijan Mishra , Thomas Foubert , Abhinav Rastogi , Adam Yang , Albert Q. Jiang , Alexandre Sablayrolles , Amélie Héliou , Amélie Martin , Anmol Agarwal , Antoine Roux , Arthur Darcet , Arthur Mensch , Baptiste Bout , Baptiste Rozière , Baudouin De Monicault , Chris Bamford , Christian Wallenwein , Christophe Renaudin , Clémence Lanfranchi , Darius Dabert , Devendra Singh Chaplot , Devon Mizelle , Diego de las Casas , Elliot Chane-Sane , Emilien Fugier , Emma Bou Hanna , Gabrielle Berrada , Gauthier Delerce , Gauthier Guinet , Georgii Novikov , Guillaume Martin , Himanshu Jaju , Jan Ludziejewski , Jason Rute , Jean-Hadrien Chabran , Jessica Chudnovsky , Joachim Studnia , Joep Barmentlo , Jonas Amar , Josselin Somerville Roberts , Julien Denize , Karan Saxena , Karmesh Yadav , Kartik Khandelwal , Kush Jain , Lélio Renard Lavaud , Léonard Blier , Lingxiao Zhao , Louis Martin , Lucile Saulnier , Luyu Gao , Marie Pellat , Mathilde Guillaumin , Mathis Felardos , Matthieu Dinot , Maxime Darrin , Maximilian Augustin , Mickaël Seznec , Neha Gupta , Nikhil Raghuraman , Olivier Duchenne , Patricia Wang , Patryk Saffer , Paul Jacob , Paul Wambergue , Paula Kurylowicz , Philomène Chagniot , Pierre Stock , Pravesh Agrawal , Rémi Delacourt , Romain Sauvestre , Roman Soletskyi , Sagar Vaze , Sandeep Subramanian , Saurabh Garg , Shashwat Dalal , Siddharth Gandhi , Sumukh Aithal , Szymon Antoniak , Teven Le Scao , Thibault Schueller , Thibaut Lavril , Thomas Robert , Thomas Wang , Timothée Lacroix , Tom Bewley , Valeriia Nemychnikova , Victor Paltz , Virgile Richard , Wen-Ding Li , William Marshall , Xuanyu Zhang , Yihan Wan , Yunhao Tang

VoxServe: Streaming-Centric Serving System for Speech Language Models

Deploying modern Speech Language Models (SpeechLMs) in streaming settings requires systems that provide low latency, high throughput, and strong guarantees of streamability. Existing systems fall short of supporting diverse models flexibly…

Machine Learning · Computer Science 2026-02-03 Keisuke Kamahori , Wei-Tzu Lee , Atindra Jha , Rohan Kadekodi , Stephanie Wang , Arvind Krishnamurthy , Baris Kasikci

Scalable Offline ASR for Command-Style Dictation in Courtrooms

We propose an open-source framework for Command-style dictation that addresses the gap between resource-intensive Online systems and high-latency Batch processing. Our approach uses Voice Activity Detection (VAD) to segment audio and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-09-16 Kumarmanas Nethil , Vaibhav Mishra , Kriti Anandan , Kavya Manohar

Pushing the Limits of On-Device Streaming ASR: A Compact, High-Accuracy English Model for Low-Latency Inference

Deploying high-quality automatic speech recognition (ASR) on edge devices requires models that jointly optimize accuracy, latency, and memory footprint while operating entirely on CPU without GPU acceleration. We conduct a systematic…

Artificial Intelligence · Computer Science 2026-04-21 Nenad Banfic , David Fan , Kunal Vaishnavi , Sam Kemp , Sunghoon Choi , Rui Ren , Sayan Shaw , Meng Tang

Sink or SWIM: Tackling Real-Time ASR at Scale

Real-time automatic speech recognition systems are increasingly integrated into interactive applications, from voice assistants to live transcription services. However, scaling these systems to support multiple concurrent clients while…

Sound · Computer Science 2026-04-14 Federico Bruzzone , Walter Cazzola , Matteo Brancaleoni , Dario Pellegrino

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based applications. For applications requiring real-time feedback like Voice Search, streaming capability becomes vital. While LSTM/RNN and CTC based ASR systems are commonly…

Sound · Computer Science 2023-05-31 Abhinav Goyal , Nikesh Garera

WhisperX: Time-Accurate Speech Transcription of Long-Form Audio

Large-scale, weakly-supervised speech recognition models, such as Whisper, have demonstrated impressive results on speech recognition across domains and languages. However, their application to long audio transcription via buffered or…

Sound · Computer Science 2023-07-12 Max Bain , Jaesung Huh , Tengda Han , Andrew Zisserman

VoxRAG: A Step Toward Transcription-Free RAG Systems in Spoken Question Answering

We introduce VoxRAG, a modular speech-to-speech retrieval-augmented generation system that bypasses transcription to retrieve semantically relevant audio segments directly from spoken queries. VoxRAG employs silence-aware segmentation,…

Information Retrieval · Computer Science 2025-08-08 Zackary Rackauckas , Julia Hirschberg

DualVC 3: Leveraging Language Model Generated Pseudo Context for End-to-end Low Latency Streaming Voice Conversion

Streaming voice conversion has become increasingly popular for its potential in real-time applications. The recently proposed DualVC 2 has achieved robust and high-quality streaming voice conversion with a latency of about 180ms.…

Audio and Speech Processing · Electrical Eng. & Systems 2024-06-13 Ziqian Ning , Shuai Wang , Pengcheng Zhu , Zhichao Wang , Jixun Yao , Lei Xie , Mengxiao Bi

VoXtream2: Full-stream TTS with dynamic speaking rate control

Full-stream text-to-speech (TTS) for interactive systems must start speaking with minimal delay while remaining controllable as text arrives incrementally. We present VoXtream2, a zero-shot full-stream TTS model with dynamic speaking-rate…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-17 Nikita Torgashov , Gustav Eje Henter , Gabriel Skantze

Universal ASR: Unifying Streaming and Non-Streaming ASR Using a Single Encoder-Decoder Model

Recently, online end-to-end ASR has gained increasing attention. However, the performance of online systems still lags far behind that of offline systems, with a large gap in quality of recognition. For specific scenarios, we can trade-off…

Sound · Computer Science 2020-10-28 Zhifu Gao , Shiliang Zhang , Ming Lei , Ian McLoughlin

WhisperPipe: A Resource-Efficient Streaming Architecture for Real-Time Automatic Speech Recognition

Real-time automatic speech recognition (ASR) systems face a fundamental trade-off between transcription accuracy and computational efficiency, particularly when deploying large-scale transformer models like Whisper. Existing streaming…

Computation and Language · Computer Science 2026-04-29 Erfan Ramezani , Mohammad Mahdi Giahi , Mohammad Erfan Zarabadipour , Amir Reza Yosefian , Hamid Ghadiri

Overcoming Latency Bottlenecks in On-Device Speech Translation: A Cascaded Approach with Alignment-Based Streaming MT

This paper tackles several challenges that arise when integrating Automatic Speech Recognition (ASR) and Machine Translation (MT) for real-time, on-device streaming speech translation. Although state-of-the-art ASR systems based on…

Computation and Language · Computer Science 2025-08-20 Zeeshan Ahmed , Frank Seide , Niko Moritz , Ju Lin , Ruiming Xie , Simone Merello , Zhe Liu , Christian Fuegen

WhisperKit: On-device Real-time ASR with Billion-Scale Transformers

Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most…

Sound · Computer Science 2025-07-16 Atila Orhon , Arda Okan , Berkin Durmus , Zach Nagengast , Eduardo Pacheco

Speech ReaLLM -- Real-time Streaming Speech Recognition with Multimodal LLMs by Teaching the Flow of Time

We introduce Speech ReaLLM, a new ASR architecture that marries "decoder-only" ASR with the RNN-T to make multimodal LLM architectures capable of real-time streaming. This is the first "decoder-only" ASR architecture designed to handle…

Computation and Language · Computer Science 2024-06-17 Frank Seide , Morrie Doulaty , Yangyang Shi , Yashesh Gaur , Junteng Jia , Chunyang Wu

VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models

Streaming vision-language models (VLMs) continuously generate responses given an instruction prompt and an online stream of input frames. This is a core mechanism for real-time visual assistants. Existing VLM frameworks predominantly assess…

Computer Vision and Pattern Recognition · Computer Science 2026-05-07 Pavan Kumar Anasosalu Vasu , Cem Koc , Fartash Faghri , Chun-Liang Li , Bo Feng , Zhengfeng Lai , Meng Cao , Oncel Tuzel , Hadi Pouransari