Related papers: Streaming Sequence Transduction through Dynamic Co…

STAR: Speech-to-Audio Generation via Representation Learning

This work presents STAR, the first end-to-end speech-to-audio generation framework, designed to enhance efficiency and address error propagation inherent in cascaded systems. Unlike prior approaches relying on text or vision, STAR leverages…

Sound · Computer Science 2025-09-23 Zeyu Xie , Xuenan Xu , Yixuan Li , Mengyue Wu , Yuexian Zou

Streaming Punctuation for Long-form Dictation with Transformers

While speech recognition Word Error Rate (WER) has reached human parity for English, long-form dictation scenarios still suffer from segmentation and punctuation problems resulting from irregular pausing patterns or slow speakers.…

Computation and Language · Computer Science 2022-12-07 Piyush Behre , Sharman Tan , Padma Varadharajan , Shuangyu Chang

Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional Context for Continuous Speech Recognition

While speech recognition Word Error Rate (WER) has reached human parity for English, continuous speech recognition scenarios such as voice typing and meeting transcriptions still suffer from segmentation and punctuation problems, resulting…

Computation and Language · Computer Science 2023-01-11 Piyush Behre , Sharman Tan , Padma Varadharajan , Shuangyu Chang

Large-Scale Streaming End-to-End Speech Translation with Neural Transducers

Neural transducers have been widely used in automatic speech recognition (ASR). In this paper, we introduce it to streaming end-to-end speech translation (ST), which aims to convert audio signals to texts in other languages directly.…

Computation and Language · Computer Science 2022-07-05 Jian Xue , Peidong Wang , Jinyu Li , Matt Post , Yashesh Gaur

Self-Taught Recognizer: Toward Unsupervised Adaptation for Speech Foundation Models

We propose an unsupervised adaptation framework, Self-TAught Recognizer (STAR), which leverages unlabeled data to enhance the robustness of automatic speech recognition (ASR) systems in diverse target domains, such as noise and accents.…

Computation and Language · Computer Science 2024-05-24 Yuchen Hu , Chen Chen , Chao-Han Huck Yang , Chengwei Qin , Pin-Yu Chen , Eng Siong Chng , Chao Zhang

Streaming automatic speech recognition with the transformer model

Encoder-decoder based sequence-to-sequence models have demonstrated state-of-the-art results in end-to-end automatic speech recognition (ASR). Recently, the transformer architecture, which uses self-attention to model temporal context…

Sound · Computer Science 2020-07-02 Niko Moritz , Takaaki Hori , Jonathan Le Roux

Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement

We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by…

Computation and Language · Computer Science 2025-06-23 Tuan-Nam Nguyen , Ngoc-Quan Pham , Seymanur Akti , Alexander Waibel

Synchronous Speech Recognition and Speech-to-Text Translation with Interactive Decoding

Speech-to-text translation (ST), which translates source language speech into target language text, has attracted intensive attention in recent years. Compared to the traditional pipeline system, the end-to-end ST model has potential…

Computation and Language · Computer Science 2019-12-17 Yuchen Liu , Jiajun Zhang , Hao Xiong , Long Zhou , Zhongjun He , Hua Wu , Haifeng Wang , Chengqing Zong

Reducing the gap between streaming and non-streaming Transducer-based ASR by adaptive two-stage knowledge distillation

Transducer is one of the mainstream frameworks for streaming speech recognition. There is a performance gap between the streaming and non-streaming transducer models due to limited context. To reduce this gap, an effective way is to ensure…

Computation and Language · Computer Science 2023-06-28 Haitao Tang , Yu Fu , Lei Sun , Jiabin Xue , Dan Liu , Yongchao Li , Zhiqiang Ma , Minghui Wu , Jia Pan , Genshun Wan , Ming'en Zhao

Two-Pass End-to-End ASR Model Compression

Speech recognition on smart devices is challenging owing to the small memory footprint. Hence small size ASR models are desirable. With the use of popular transducer-based models, it has become possible to practically deploy streaming…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-11 Nauman Dawalatabad , Tushar Vatsal , Ashutosh Gupta , Sungsoo Kim , Shatrughan Singh , Dhananjaya Gowda , Chanwoo Kim

STAR: Scale-wise Text-conditioned AutoRegressive image generation

We introduce STAR, a text-to-image model that employs a scale-wise auto-regressive paradigm. Unlike VAR, which is constrained to class-conditioned synthesis for images up to 256$\times$256, STAR enables text-driven image generation up to…

Computer Vision and Pattern Recognition · Computer Science 2025-02-20 Xiaoxiao Ma , Mohan Zhou , Tao Liang , Yalong Bai , Tiejun Zhao , Biye Li , Huaian Chen , Yi Jin

Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable End-to-End Speech Recognition

Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR), and requires significantly less training time than RNN-based models. The original Transformer, with…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-14 Wenyong Huang , Wenchao Hu , Yu Ting Yeung , Xiao Chen

Transcribing and Translating, Fast and Slow: Joint Speech Translation and Recognition

We propose the joint speech translation and recognition (JSTAR) model that leverages the fast-slow cascaded encoder architecture for simultaneous end-to-end automatic speech recognition (ASR) and speech translation (ST). The model is…

Audio and Speech Processing · Electrical Eng. & Systems 2024-12-23 Niko Moritz , Ruiming Xie , Yashesh Gaur , Ke Li , Simone Merello , Zeeshan Ahmed , Frank Seide , Christian Fuegen

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence…

Computation and Language · Computer Science 2020-11-03 Xutai Ma , Yongqiang Wang , Mohammad Javad Dousti , Philipp Koehn , Juan Pino

Shifted Chunk Encoder for Transformer Based Streaming End-to-End ASR

Currently, there are mainly three kinds of Transformer encoder based streaming End to End (E2E) Automatic Speech Recognition (ASR) approaches, namely time-restricted methods, chunk-wise methods, and memory-based methods. Generally, all of…

Sound · Computer Science 2022-09-27 Fangyuan Wang , Bo Xu

Transformers from Compressed Representations

Compressed file formats are the corner stone of efficient data storage and transmission, yet their potential for representation learning remains largely underexplored. We introduce TEMPEST (TransformErs froM comPressed rEpreSenTations), a…

Machine Learning · Computer Science 2025-10-30 Juan C. Leon Alcazar , Mattia Soldan , Mohammad Saatialsoruji , Alejandro Pardo , Hani Itani , Juan Camilo Perez , Bernard Ghanem

Segmentation-Free Streaming Machine Translation

Streaming Machine Translation (MT) is the task of translating an unbounded input text stream in real-time. The traditional cascade approach, which combines an Automatic Speech Recognition (ASR) and an MT system, relies on an intermediate…

Computation and Language · Computer Science 2024-05-29 Javier Iranzo-Sánchez , Jorge Iranzo-Sánchez , Adrià Giménez , Jorge Civera , Alfons Juan

Spanning Tree Autoregressive Visual Generation

We present Spanning Tree Autoregressive (STAR) modeling, which can incorporate prior knowledge of images, such as center bias and locality, to maintain sampling performance while also providing sufficiently flexible sequence orders to…

Computer Vision and Pattern Recognition · Computer Science 2025-11-24 Sangkyu Lee , Changho Lee , Janghoon Han , Hosung Song , Tackgeun You , Hwasup Lim , Stanley Jungkyu Choi , Honglak Lee , Youngjae Yu

Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR

Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic…

Audio and Speech Processing · Electrical Eng. & Systems 2023-04-27 Xilai Li , Goeric Huybrechts , Srikanth Ronanki , Jeff Farris , Sravan Bodapati

Adapting End-to-End Speech Recognition for Readable Subtitles

Automatic speech recognition (ASR) systems are primarily evaluated on transcription accuracy. However, in some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.…

Computation and Language · Computer Science 2020-05-26 Danni Liu , Jan Niehues , Gerasimos Spanakis