Related papers: Enhancing CTC-Based Visual Speech Recognition

LiteVSR: Efficient Visual Speech Recognition by Learning from Speech Representations of Unlabeled Data

This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends…

Computer Vision and Pattern Recognition · Computer Science 2023-12-18 Hendrik Laux , Emil Mededovic , Ahmed Hallawa , Lukas Martin , Arne Peine , Anke Schmeink

Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping

Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully…

Sound · Computer Science 2023-08-14 Yasser Abdelaziz Dahou Djilali , Sanath Narayan , Haithem Boussaid , Ebtessam Almazrouei , Merouane Debbah

ASR is all you need: cross-modal distillation for lip reading

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a…

Computer Vision and Pattern Recognition · Computer Science 2020-04-01 Triantafyllos Afouras , Joon Son Chung , Andrew Zisserman

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-24 Maxime Burchi , Krishna C. Puvvada , Jagadeesh Balam , Boris Ginsburg , Radu Timofte

CR-CTC: Consistency regularization on CTC for improved speech recognition

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-17 Zengwei Yao , Wei Kang , Xiaoyu Yang , Fangjun Kuang , Liyong Guo , Han Zhu , Zengrui Jin , Zhaoqing Li , Long Lin , Daniel Povey

DistillW2V2: A Small and Streaming Wav2vec 2.0 Based ASR Model

Wav2vec 2.0 (W2V2) has shown impressive performance in automatic speech recognition (ASR). However, the large model size and the non-streaming architecture make it hard to be used under low-resource or streaming scenarios. In this work, we…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-17 Yanzhe Fu , Yueteng Kang , Songjun Cao , Long Ma

Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train…

Computer Vision and Pattern Recognition · Computer Science 2023-01-05 Maxime Burchi , Radu Timofte

Building Accurate Low Latency ASR for Streaming Voice Search

Automatic Speech Recognition (ASR) plays a crucial role in voice-based applications. For applications requiring real-time feedback like Voice Search, streaming capability becomes vital. While LSTM/RNN and CTC based ASR systems are commonly…

Sound · Computer Science 2023-05-31 Abhinav Goyal , Nikesh Garera

Streaming Audio-Visual Speech Recognition with Alignment Regularization

In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-04 Pingchuan Ma , Niko Moritz , Stavros Petridis , Christian Fuegen , Maja Pantic

Listen, Look and Deliberate: Visual context-aware speech recognition using pre-trained text-video representations

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-10 Shahram Ghorbani , Yashesh Gaur , Yu Shi , Jinyu Li

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first…

Computer Vision and Pattern Recognition · Computer Science 2023-04-04 Xubo Liu , Egor Lakomkin , Konstantinos Vougioukas , Pingchuan Ma , Honglie Chen , Ruiming Xie , Morrie Doulaty , Niko Moritz , Jáchym Kolář , Stavros Petridis , Maja Pantic , Christian Fuegen

Audio-Visual Speech Recognition is Worth 32$\times$32$\times$8 Voxels

Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires…

Computer Vision and Pattern Recognition · Computer Science 2021-09-21 Dmitriy Serdyuk , Otavio Braga , Olivier Siohan

VALLR: Visual ASR Language Model for Lip Reading

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Marshall Thomas , Edward Fish , Richard Bowden

Application of Knowledge Distillation to Multi-task Speech Representation Learning

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-22 Mine Kerpicci , Van Nguyen , Shuhua Zhang , Erik Visser

The NPU-ASLP-LiAuto System Description for Visual Speech Recognition in CNVSRC 2023

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-01 He Wang , Pengcheng Guo , Wei Chen , Pan Zhou , Lei Xie

SVTRv2: CTC Beats Encoder-Decoder Models in Scene Text Recognition

Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned…

Computer Vision and Pattern Recognition · Computer Science 2025-07-16 Yongkun Du , Zhineng Chen , Hongtao Xie , Caiyan Jia , Yu-Gang Jiang

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Pingchuan Ma , Alexandros Haliassos , Adriana Fernandez-Lopez , Honglie Chen , Stavros Petridis , Maja Pantic

AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition

Visual Speech Recognition (VSR) aims to recognize corresponding text by analyzing visual information from lip movements. Due to the high variability and weak information of lip movements, VSR tasks require effectively utilizing any…

Sound · Computer Science 2024-10-23 Zehua Liu , Xiaolou Li , Chen Chen , Li Guo , Lantian Li , Dong Wang

Distilling a Pretrained Language Model to a Multilingual ASR Model

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are…

Computation and Language · Computer Science 2022-06-28 Kwanghee Choi , Hyung-Min Park

Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech Representation

This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-19 Minsu Kim , Jeong Hun Yeo , Se Jin Park , Hyeongseop Rha , Yong Man Ro