Long Lin — Scifaro

OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models

We present OmniVoice, a massively multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional…

Computation and Language · Computer Science 2026-04-22 Han Zhu , Lingxuan Ye , Wei Kang , Zengwei Yao , Liyong Guo , Fangjun Kuang , Zhifeng Han , Weiji Zhuang , Long Lin , Daniel Povey

ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

Generating spoken dialogue is inherently more complex than monologue text-to-speech (TTS), as it demands both realistic turn-taking and the maintenance of distinct speaker timbres. While existing autoregressive (AR) models have made…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-15 Han Zhu , Wei Kang , Liyong Guo , Zengwei Yao , Fangjun Kuang , Weiji Zhuang , Zhaoqing Li , Zhifeng Han , Dong Zhang , Xin Zhang , Xingchen Song , Lingxuan Ye , Long Lin , Daniel Povey

Flow2GAN: Hybrid Flow Matching and GAN with Multi-Resolution Network for Few-step High-Fidelity Audio Generation

Existing dominant methods for audio generation include Generative Adversarial Networks (GANs) and diffusion-based methods like Flow Matching. GANs suffer from slow convergence during training, while diffusion methods require multi-step…

Audio and Speech Processing · Electrical Eng. & Systems 2026-03-10 Zengwei Yao , Wei Kang , Han Zhu , Liyong Guo , Lingxuan Ye , Fangjun Kuang , Weiji Zhuang , Zhaoqing Li , Zhifeng Han , Long Lin , Daniel Povey

CLARAE: Clarity Preserving Reconstruction AutoEncoder for Denoising and Rhythm Classification of Intracardiac Electrograms

Intracavitary atrial electrograms (EGMs) provide high-resolution insights into cardiac electrophysiology but are often contaminated by noise and remain high-dimensional, limiting real-time analysis. We introduce CLARAE (CLArity-preserving…

Signal Processing · Electrical Eng. & Systems 2025-10-22 Long Lin , Pablo Peiro-Corbacho , Pablo Ávila , Alejandro Carta-Bergaz , Ángel Arenal , Gonzalo R. Ríos-Muñoz , Carlos Sevilla-Salcedo

ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching

Existing large-scale zero-shot text-to-speech (TTS) models deliver high speech quality but suffer from slow inference speeds due to massive parameters. To address this issue, this paper introduces ZipVoice, a high-quality…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-08 Han Zhu , Wei Kang , Zengwei Yao , Liyong Guo , Fangjun Kuang , Zhaoqing Li , Weiji Zhuang , Long Lin , Daniel Povey

Latent Representations of Intracardiac Electrograms for Atrial Fibrillation Driver Detection

Atrial Fibrillation (AF) is the most prevalent sustained arrhythmia, yet current ablation therapies, including pulmonary vein isolation, are frequently ineffective in persistent AF due to the involvement of non-pulmonary vein drivers. This…

Machine Learning · Computer Science 2025-07-29 Pablo Peiro-Corbacho , Long Lin , Pablo Ávila , Alejandro Carta-Bergaz , Ángel Arenal , Carlos Sevilla-Salcedo , Gonzalo R. Ríos-Muñoz

k2SSL: A Faster and Better Framework for Self-Supervised Speech Representation Learning

Self-supervised learning (SSL) has achieved great success in speech-related tasks. While Transformer and Conformer architectures have dominated SSL backbones, encoders like Zipformer, which excel in automatic speech recognition (ASR),…

Audio and Speech Processing · Electrical Eng. & Systems 2025-03-25 Yifan Yang , Jianheng Zhuo , Zengrui Jin , Ziyang Ma , Xiaoyu Yang , Zengwei Yao , Liyong Guo , Wei Kang , Fangjun Kuang , Long Lin , Daniel Povey , Xie Chen

CR-CTC: Consistency regularization on CTC for improved speech recognition

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-17 Zengwei Yao , Wei Kang , Xiaoyu Yang , Fangjun Kuang , Liyong Guo , Han Zhu , Zengrui Jin , Zhaoqing Li , Long Lin , Daniel Povey

LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges…

Sound · Computer Science 2024-09-04 Zengrui Jin , Yifan Yang , Mohan Shi , Wei Kang , Xiaoyu Yang , Zengwei Yao , Fangjun Kuang , Liyong Guo , Lingwei Meng , Long Lin , Yong Xu , Shi-Xiong Zhang , Daniel Povey

Zipformer: A faster and better encoder for automatic speech recognition

The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more…

Audio and Speech Processing · Electrical Eng. & Systems 2024-04-11 Zengwei Yao , Liyong Guo , Xiaoyu Yang , Wei Kang , Fangjun Kuang , Yifan Yang , Zengrui Jin , Long Lin , Daniel Povey

PromptASR for contextualized ASR with controllable style

Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-25 Xiaoyu Yang , Wei Kang , Zengwei Yao , Yifan Yang , Liyong Guo , Fangjun Kuang , Long Lin , Daniel Povey

Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with…

Audio and Speech Processing · Electrical Eng. & Systems 2024-01-17 Wei Kang , Xiaoyu Yang , Zengwei Yao , Fangjun Kuang , Yifan Yang , Liyong Guo , Long Lin , Daniel Povey

Blank-regularized CTC for Frame Skipping in Neural Transducer

Neural Transducer and connectionist temporal classification (CTC) are popular end-to-end automatic speech recognition systems. Due to their frame-synchronous design, blank symbols are introduced to address the length mismatch between…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-22 Yifan Yang , Xiaoyu Yang , Liyong Guo , Zengwei Yao , Wei Kang , Fangjun Kuang , Long Lin , Xie Chen , Daniel Povey

Delay-penalized CTC implemented based on Finite State Transducer

Connectionist Temporal Classification (CTC) suffers from the latency problem when applied to streaming models. We argue that in CTC lattice, the alignments that can access more future context are preferred during training, thereby leading…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-22 Zengwei Yao , Wei Kang , Fangjun Kuang , Liyong Guo , Xiaoyu Yang , Yifan Yang , Long Lin , Daniel Povey

Predicting Multi-Codebook Vector Quantization Indexes for Knowledge Distillation

Knowledge distillation(KD) is a common approach to improve model performance in automatic speech recognition (ASR), where a student model is trained to imitate the output behaviour of a teacher model. However, traditional KD methods suffer…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Liyong Guo , Xiaoyu Yang , Quandong Wang , Yuxiang Kong , Zengwei Yao , Fan Cui , Fangjun Kuang , Wei Kang , Long Lin , Mingshuang Luo , Piotr Zelasko , Daniel Povey

Fast and parallel decoding for transducer

The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-02 Wei Kang , Liyong Guo , Fangjun Kuang , Long Lin , Mingshuang Luo , Zengwei Yao , Xiaoyu Yang , Piotr Żelasko , Daniel Povey

Pruned RNN-T for fast, memory-efficient ASR training

The RNN-Transducer (RNN-T) framework for speech recognition has been growing in popularity, particularly for deployed real-time ASR systems, because it combines high accuracy with naturally streaming recognition. One of the drawbacks of…

Audio and Speech Processing · Electrical Eng. & Systems 2022-06-28 Fangjun Kuang , Liyong Guo , Wei Kang , Long Lin , Mingshuang Luo , Zengwei Yao , Daniel Povey

Winning Isn't Everything: Enhancing Game Development with Intelligent Agents

Recently, there have been several high-profile achievements of agents learning to play games against humans and beat them. In this paper, we study the problem of training intelligent agents in service of game development. Unlike the agents…

Artificial Intelligence · Computer Science 2020-04-29 Yunqi Zhao , Igor Borovikov , Fernando de Mesentier Silva , Ahmad Beirami , Jason Rupert , Caedmon Somers , Jesse Harder , John Kolen , Jervis Pinto , Reza Pourabolghasem , James Pestrak , Harold Chaput , Mohsen Sardari , Long Lin , Sundeep Narravula , Navid Aghdaie , Kazi Zaman