Li-Rong Dai — Scifaro

Generative Diffusion Contrastive Network for Multi-View Clustering

In recent years, Multi-View Clustering (MVC) has been significantly advanced under the influence of deep learning. By integrating heterogeneous data from multiple views, MVC enhances clustering analysis, making multi-view fusion critical to…

Computer Vision and Pattern Recognition · Computer Science 2026-01-21 Jian Zhu , Xin Zou , Xi Wang , Lei Liu , Chang Tang , Li-Rong Dai

Enhancing Noise Robustness for Neural Speech Codecs through Resource-Efficient Progressive Quantization Perturbation Simulation

Noise robustness remains a critical challenge for deploying neural speech codecs in real-world acoustic scenarios where background noise is often inevitable. A key observation we make is that even slight input noise perturbations can cause…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-14 Rui-Chen Zheng , Yang Ai , Hui-Peng Du , Li-Rong Dai

Trusted Mamba Contrastive Network for Multi-View Clustering

Multi-view clustering can partition data samples into their categories by learning a consensus representation in an unsupervised way and has received more and more attention in recent years. However, there is an untrusted fusion problem.…

Computer Vision and Pattern Recognition · Computer Science 2025-01-08 Jian Zhu , Xin Zou , Lei Liu , Zhangmin Huang , Ying Zhang , Chang Tang , Li-Rong Dai

Adaptive Confidence Multi-View Hashing for Multimedia Retrieval

The multi-view hash method converts heterogeneous data from multiple views into binary hash codes, which is one of the critical technologies in multimedia retrieval. However, the current methods mainly explore the complementarity among…

Computer Vision and Pattern Recognition · Computer Science 2024-01-17 Jian Zhu , Yu Cui , Zhangmin Huang , Xingyu Li , Lei Liu , Lingfang Zeng , Li-Rong Dai

CASA-ASR: Context-Aware Speaker-Attributed ASR

Recently, speaker-attributed automatic speech recognition (SA-ASR) has attracted a wide attention, which aims at answering the question ``who spoke what''. Different from modular systems, end-to-end (E2E) SA-ASR minimizes the…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-23 Mohan Shi , Zhihao Du , Qian Chen , Fan Yu , Yangze Li , Shiliang Zhang , Jie Zhang , Li-Rong Dai

Semantic VAD: Low-Latency Voice Activity Detection for Speech Interaction

For speech interaction, voice activity detection (VAD) is often used as a front-end. However, traditional VAD algorithms usually need to wait for a continuous tail silence to reach a preset maximum duration before segmentation, resulting in…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-23 Mohan Shi , Yuchun Shu , Lingyun Zuo , Qian Chen , Shiliang Zhang , Jie Zhang , Li-Rong Dai

Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection

In this paper, we propose a joint generative and contrastive representation learning method (GeCo) for anomalous sound detection (ASD). GeCo exploits a Predictive AutoEncoder (PAE) equipped with self-attention as a generative model to…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-23 Xiao-Min Zeng , Yan Song , Zhu Zhuo , Yu Zhou , Yu-Hong Li , Hui Xue , Li-Rong Dai , Ian McLoughlin

AST-SED: An Effective Sound Event Detection Method Based on Audio Spectrogram Transformer

In this paper, we propose an effective sound event detection (SED) method based on the audio spectrogram transformer (AST) model, pretrained on the large-scale AudioSet for audio tagging (AT) task, termed AST-SED. Pretrained AST models have…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-08 Kang Li , Yan Song , Li-Rong Dai , Ian McLoughlin , Xin Fang , Lin Liu

A Comparative Study on Multichannel Speaker-Attributed Automatic Speech Recognition in Multi-party Meetings

Speaker-attributed automatic speech recognition (SA-ASR) in multi-party meeting scenarios is one of the most valuable and challenging ASR task. It was shown that single-channel frame-level diarization with serialized output training…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-03 Mohan Shi , Jie Zhang , Zhihao Du , Fan Yu , Qian Chen , Shiliang Zhang , Li-Rong Dai

Robust Data2vec: Noise-robust Speech Representation Learning for ASR by Combining Regression and Improved Contrastive Learning

Self-supervised pre-training methods based on contrastive learning or regression tasks can utilize more unlabeled data to improve the performance of automatic speech recognition (ASR). However, the robustness impact of combining the two…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-28 Qiu-Shi Zhu , Long Zhou , Jie Zhang , Shu-Jie Liu , Yu-Chen Hu , Li-Rong Dai

Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR)…

Image and Video Processing · Electrical Eng. & Systems 2022-07-12 Zi-Qiang Zhang , Jie Zhang , Jian-Shu Zhang , Ming-Hui Wu , Xin Fang , Li-Rong Dai

Joint Training of Speech Enhancement and Self-supervised Model for Noise-robust ASR

Speech enhancement (SE) is usually required as a front end to improve the speech quality in noisy environments, while the enhanced speech might not be optimal for automatic speech recognition (ASR) systems due to speech distortion. On the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-27 Qiu-Shi Zhu , Jie Zhang , Zi-Qiang Zhang , Li-Rong Dai

Supervised and Self-supervised Pretraining Based COVID-19 Detection Using Acoustic Breathing/Cough/Speech Signals

In this work, we propose a bi-directional long short-term memory (BiLSTM) network based COVID-19 detection method using breath/speech/cough signals. By using the acoustic signals to train the network, respectively, we can build individual…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-10 Xing-Yu Chen , Qiu-Shi Zhu , Jie Zhang , Li-Rong Dai

A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition

Wav2vec2.0 is a popular self-supervised pre-training framework for learning speech representations in the context of automatic speech recognition (ASR). It was shown that wav2vec2.0 has a good robustness against the domain shift, while the…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-10 Qiu-Shi Zhu , Jie Zhang , Zi-Qiang Zhang , Ming-Hui Wu , Xin Fang , Li-Rong Dai

A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition

Unpaired data has shown to be beneficial for low-resource automatic speech recognition~(ASR), which can be involved in the design of hybrid models with multi-task training or language model dependent pre-training. In this work, we leverage…

Sound · Computer Science 2022-04-06 Ye-Qian Du , Jie Zhang , Qiu-Shi Zhu , Li-Rong Dai , Ming-Hui Wu , Xin Fang , Zhou-Wang Yang

XLST: Cross-lingual Self-training to Learn Multilingual Representation for Low Resource Speech Recognition

In this paper, we propose a weakly supervised multilingual representation learning framework, called cross-lingual self-training (XLST). XLST is able to utilize a small amount of annotated data from high-resource languages to improve the…

Audio and Speech Processing · Electrical Eng. & Systems 2021-03-16 Zi-Qiang Zhang , Yan Song , Ming-Hui Wu , Xin Fang , Li-Rong Dai

Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention

In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed…

Computer Vision and Pattern Recognition · Computer Science 2020-12-29 Hang Chen , Jun Du , Yu Hu , Li-Rong Dai , Chin-Hui Lee , Bao-Cai Yin

Correlating Subword Articulation with Lip Shapes for Embedding Aware Audio-Visual Speech Enhancement

In this paper, we propose a visual embedding approach to improving embedding aware speech enhancement (EASE) by synchronizing visual lip frames at the phone and place of articulation levels. We first extract visual embedding from lip frames…

Sound · Computer Science 2020-09-22 Hang Chen , Jun Du , Yu Hu , Li-Rong Dai , Bao-Cai Yin , Chin-Hui Lee

Voice Conversion by Cascading Automatic Speech Recognition and Text-to-Speech Synthesis with Prosody Transfer

With the development of automatic speech recognition (ASR) and text-to-speech synthesis (TTS) technique, it's intuitive to construct a voice conversion system by cascading an ASR and TTS system. In this paper, we present a ASR-TTS method…

Audio and Speech Processing · Electrical Eng. & Systems 2020-09-04 Jing-Xuan Zhang , Li-Juan Liu , Yan-Nian Chen , Ya-Jun Hu , Yuan Jiang , Zhen-Hua Ling , Li-Rong Dai

Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning

This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output…

Audio and Speech Processing · Electrical Eng. & Systems 2020-08-07 Jing-Xuan Zhang , Zhen-Hua Ling , Li-Rong Dai