English
Related papers

Related papers: Enhancing CTC-Based Visual Speech Recognition

200 papers

This paper proposes a novel, resource-efficient approach to Visual Speech Recognition (VSR) leveraging speech representations produced by any trained Automatic Speech Recognition (ASR) model. Moving away from the resource-intensive trends…

Computer Vision and Pattern Recognition · Computer Science 2023-12-18 Hendrik Laux , Emil Mededovic , Ahmed Hallawa , Lukas Martin , Arne Peine , Anke Schmeink

Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully…

The goal of this work is to train strong models for visual speech recognition without requiring human annotated ground truth data. We achieve this by distilling from an Automatic Speech Recognition (ASR) model that has been trained on a…

Computer Vision and Pattern Recognition · Computer Science 2020-04-01 Triantafyllos Afouras , Joon Son Chung , Andrew Zisserman

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-24 Maxime Burchi , Krishna C. Puvvada , Jagadeesh Balam , Boris Ginsburg , Radu Timofte

Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR), renowned for its simplicity and computational efficiency. However, it often falls short in recognition performance. In this work, we…

Audio and Speech Processing · Electrical Eng. & Systems 2025-02-17 Zengwei Yao , Wei Kang , Xiaoyu Yang , Fangjun Kuang , Liyong Guo , Han Zhu , Zengrui Jin , Zhaoqing Li , Long Lin , Daniel Povey

Wav2vec 2.0 (W2V2) has shown impressive performance in automatic speech recognition (ASR). However, the large model size and the non-streaming architecture make it hard to be used under low-resource or streaming scenarios. In this work, we…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-17 Yanzhe Fu , Yueteng Kang , Songjun Cao , Long Ma

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train…

Computer Vision and Pattern Recognition · Computer Science 2023-01-05 Maxime Burchi , Radu Timofte

Automatic Speech Recognition (ASR) plays a crucial role in voice-based applications. For applications requiring real-time feedback like Voice Search, streaming capability becomes vital. While LSTM/RNN and CTC based ASR systems are commonly…

Sound · Computer Science 2023-05-31 Abhinav Goyal , Nikesh Garera

In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-04 Pingchuan Ma , Niko Moritz , Stavros Petridis , Christian Fuegen , Maja Pantic

In this study, we try to address the problem of leveraging visual signals to improve Automatic Speech Recognition (ASR), also known as visual context-aware ASR (VC-ASR). We explore novel VC-ASR approaches to leverage video and text…

Audio and Speech Processing · Electrical Eng. & Systems 2020-11-10 Shahram Ghorbani , Yashesh Gaur , Yu Shi , Jinyu Li

Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first…

Audio-visual automatic speech recognition (AV-ASR) introduces the video modality into the speech recognition process, often by relying on information conveyed by the motion of the speaker's mouth. The use of the video signal requires…

Computer Vision and Pattern Recognition · Computer Science 2021-09-21 Dmitriy Serdyuk , Otavio Braga , Olivier Siohan

Lip Reading, or Visual Automatic Speech Recognition (V-ASR), is a complex task requiring the interpretation of spoken language exclusively from visual cues, primarily lip movements and facial expressions. This task is especially challenging…

Computer Vision and Pattern Recognition · Computer Science 2026-01-06 Marshall Thomas , Edward Fish , Richard Bowden

Model architectures such as wav2vec 2.0 and HuBERT have been proposed to learn speech representations from audio waveforms in a self-supervised manner. When they are combined with downstream tasks such as keyword spotting and speaker…

Audio and Speech Processing · Electrical Eng. & Systems 2023-05-22 Mine Kerpicci , Van Nguyen , Shuhua Zhang , Erik Visser

This paper delineates the visual speech recognition (VSR) system introduced by the NPU-ASLP-LiAuto (Team 237) in the first Chinese Continuous Visual Speech Recognition Challenge (CNVSRC) 2023, engaging in the fixed and open tracks of…

Audio and Speech Processing · Electrical Eng. & Systems 2024-03-01 He Wang , Pengcheng Guo , Wei Chen , Pan Zhou , Lei Xie

Connectionist temporal classification (CTC)-based scene text recognition (STR) methods, e.g., SVTR, are widely employed in OCR applications, mainly due to their simple architecture, which only contains a visual model and a CTC-aligned…

Computer Vision and Pattern Recognition · Computer Science 2025-07-16 Yongkun Du , Zhineng Chen , Hongtao Xie , Caiyan Jia , Yu-Gang Jiang

Audio-visual speech recognition has received a lot of attention due to its robustness against acoustic noise. Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been…

Computer Vision and Pattern Recognition · Computer Science 2023-06-29 Pingchuan Ma , Alexandros Haliassos , Adriana Fernandez-Lopez , Honglie Chen , Stavros Petridis , Maja Pantic

Visual Speech Recognition (VSR) aims to recognize corresponding text by analyzing visual information from lip movements. Due to the high variability and weak information of lip movements, VSR tasks require effectively utilizing any…

Sound · Computer Science 2024-10-23 Zehua Liu , Xiaolou Li , Chen Chen , Li Guo , Lantian Li , Dong Wang

Multilingual speech data often suffer from long-tailed language distribution, resulting in performance degradation. However, multilingual text data is much easier to obtain, yielding a more useful general language model. Hence, we are…

Computation and Language · Computer Science 2022-06-28 Kwanghee Choi , Hyung-Min Park

This paper explores sentence-level multilingual Visual Speech Recognition (VSR) that can recognize different languages with a single trained model. As the massive multilingual modeling of visual data requires huge computational costs, we…

Audio and Speech Processing · Electrical Eng. & Systems 2024-07-19 Minsu Kim , Jeong Hun Yeo , Se Jin Park , Hyeongseop Rha , Yong Man Ro
‹ Prev 1 2 3 10 Next ›