Related papers: Robust Self-Supervised Audio-Visual Speech Recogni…

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Pingchuan Ma , Alexandros Haliassos , Adriana Fernandez-Lopez , Honglie Chen , Stavros Petridis , Maja Pantic

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT…

Audio and Speech Processing · Electrical Eng. & Systems 2022-03-15 Bowen Shi , Wei-Ning Hsu , Kushal Lakhotia , Abdelrahman Mohamed

VisG AV-HuBERT: Viseme-Guided AV-HuBERT

Audio-Visual Speech Recognition (AVSR) systems nowadays integrate Large Language Model (LLM) decoders with transformer-based encoders, achieving state-of-the-art results. However, the relative contributions of improved language modelling…

Audio and Speech Processing · Electrical Eng. & Systems 2026-04-02 Aristeidis Papadopoulos , Rishabh Jain , Naomi Harte

Practice of the conformer enhanced AUDIO-VISUAL HUBERT on Mandarin and English

Considering the bimodal nature of human speech perception, lips, and teeth movement has a pivotal role in automatic speech recognition. Benefiting from the correlated and noise-invariant visual information, audio-visual recognition systems…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-23 Xiaoming Ren , Chao Li , Shenjian Wang , Biao Li

Self-Supervised Adaptive AV Fusion Module for Pre-Trained ASR Models

Automatic speech recognition (ASR) has reached a level of accuracy in recent years, that even outperforms humans in transcribing speech to text. Nevertheless, all current ASR approaches show a certain weakness against ambient noise. To…

Sound · Computer Science 2023-12-22 Christopher Simic , Tobias Bocklet

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

This paper investigates self-supervised pre-training for audio-visual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-18 Bowen Shi , Abdelrahman Mohamed , Wei-Ning Hsu

Robust LLM-based Audio-Visual Speech Recognition with Sparse Modality Alignment and Visual Unit-Guided Refinement

Audio-Visual Speech Recognition (AVSR) integrates acoustic and visual information to enhance robustness in adverse acoustic conditions. Recent advances in Large Language Models (LLMs) have yielded competitive automatic speech recognition…

Sound · Computer Science 2026-03-05 Fei Su , Cancan Li , Juan Liu , Wei Ju , Hongbin Suo , Ming Li

Audio-Visual Speech Enhancement and Separation by Utilizing Multi-Modal Self-Supervised Embeddings

AV-HuBERT, a multi-modal self-supervised learning model, has been shown to be effective for categorical problems such as automatic speech recognition and lip-reading. This suggests that useful audio-visual speech representations can be…

Audio and Speech Processing · Electrical Eng. & Systems 2023-06-02 I-Chun Chern , Kuo-Hsuan Hung , Yi-Ting Chen , Tassadaq Hussain , Mandar Gogate , Amir Hussain , Yu Tsao , Jen-Cheng Hou

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-24 Maxime Burchi , Krishna C. Puvvada , Jagadeesh Balam , Boris Ginsburg , Radu Timofte

Real-Time Audio-Visual Speech Enhancement Using Pre-trained Visual Representations

Speech enhancement in audio-only settings remains challenging, particularly in the presence of interfering speakers. This paper presents a simple yet effective real-time audio-visual speech enhancement (AVSE) system, RAVEN, which isolates…

Audio and Speech Processing · Electrical Eng. & Systems 2025-08-05 T. Aleksandra Ma , Sile Yin , Li-Chia Yang , Shuo Zhang

LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends…

Computer Vision and Pattern Recognition · Computer Science 2023-12-18 Hendrik Laux , Emil Mededovic , Ahmed Hallawa , Lukas Martin , Arne Peine , Anke Schmeink

A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment

Automating dysarthria assessments offers the opportunity to develop practical, low-cost tools that address the current limitations of manual and subjective assessments. Nonetheless, the small size of most dysarthria datasets makes it…

Computation and Language · Computer Science 2024-03-26 Xavier F. Cadet , Ranya Aloufi , Sara Ahmadi-Abhari , Hamed Haddadi

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Audio-Visual Speech Recognition (AVSR) models have surpassed their audio-only counterparts in terms of performance. However, the interpretability of AVSR systems, particularly the role of the visual modality, remains under-explored. In this…

Audio and Speech Processing · Electrical Eng. & Systems 2026-05-06 Aristeidis Papadopoulos , Naomi Harte

Visual-Aware Speech Recognition for Noisy Scenarios

Humans have the ability to utilize visual cues, such as lip movements and visual scenes, to enhance auditory perception, particularly in noisy environments. However, current Automatic Speech Recognition (ASR) or Audio-Visual Speech…

Computation and Language · Computer Science 2025-04-11 Lakshmipathi Balaji , Karan Singla

Hourglass-AVSR: Down-Up Sampling-based Computational Efficiency Model for Audio-Visual Speech Recognition

Recently audio-visual speech recognition (AVSR), which better leverages video modality as additional information to extend automatic speech recognition (ASR), has shown promising results in complex acoustic environments. However, there is…

Sound · Computer Science 2023-12-15 Fan Yu , Haoxu Wang , Ziyang Ma , Shiliang Zhang

Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the…

Audio and Speech Processing · Electrical Eng. & Systems 2025-10-07 Richard Lee Lai , Jen-Cheng Hou , I-Chun Chern , Kuo-Hsuan Hung , Yi-Ting Chen , Mandar Gogate , Tughrul Arslan , Amir Hussain , Yu Tsao

Improving Noise Robust Audio-Visual Speech Recognition via Router-Gated Cross-Modal Feature Fusion

Robust audio-visual speech recognition (AVSR) in noisy environments remains challenging, as existing systems struggle to estimate audio reliability and dynamically adjust modality reliance. We propose router-gated cross-modal feature…

Computer Vision and Pattern Recognition · Computer Science 2025-08-27 DongHoon Lim , YoungChae Kim , Dong-Hyun Kim , Da-Hee Yang , Joon-Hyuk Chang

deHuBERT: Disentangling Noise in a Self-supervised Model for Robust Speech Recognition

Existing self-supervised pre-trained speech models have offered an effective way to leverage massive unannotated corpora to build good automatic speech recognition (ASR). However, many current models are trained on a clean corpus from a…

Sound · Computer Science 2023-03-01 Dianwen Ng , Ruixi Zhang , Jia Qi Yip , Zhao Yang , Jinjie Ni , Chong Zhang , Yukun Ma , Chongjia Ni , Eng Siong Chng , Bin Ma

Frustratingly Easy Noise-aware Training of Acoustic Models

Environmental noises and reverberation have a detrimental effect on the performance of automatic speech recognition (ASR) systems. Multi-condition training of neural network-based acoustic models is used to deal with this problem, but it…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-03 Desh Raj , Jesus Villalba , Daniel Povey , Sanjeev Khudanpur

Uncovering the Visual Contribution in Audio-Visual Speech Recognition

Audio-Visual Speech Recognition (AVSR) combines auditory and visual speech cues to enhance the accuracy and robustness of speech recognition systems. Recent advancements in AVSR have improved performance in noisy environments compared to…

Audio and Speech Processing · Electrical Eng. & Systems 2025-04-29 Zhaofeng Lin , Naomi Harte