Related papers: Audio-Visual Efficient Conformer for Robust Speech…

Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer

Humans are adept at leveraging visual cues from lip movements for recognizing speech in adverse listening conditions. Audio-Visual Speech Recognition (AVSR) models follow similar approach to achieve robust speech recognition in noisy…

Audio and Speech Processing · Electrical Eng. & Systems 2024-05-24 Maxime Burchi , Krishna C. Puvvada , Jagadeesh Balam , Boris Ginsburg , Radu Timofte

Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition

The recently proposed Conformer architecture has shown state-of-the-art performances in Automatic Speech Recognition by combining convolution with attention to model both local and global dependencies. In this paper, we study how to reduce…

Audio and Speech Processing · Electrical Eng. & Systems 2021-09-09 Maxime Burchi , Valentin Vielzeuf

Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End Speech Recognition

Optimization of modern ASR architectures is among the highest priority tasks since it saves many computational resources for model training and inference. The work proposes a new Uconv-Conformer architecture based on the standard Conformer…

Audio and Speech Processing · Electrical Eng. & Systems 2023-03-14 Andrei Andrusenko , Rauf Nasretdinov , Aleksei Romanenko

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner. In particular, the audio and visual encoders learn to extract…

Computer Vision and Pattern Recognition · Computer Science 2021-02-15 Pingchuan Ma , Stavros Petridis , Maja Pantic

An improved hybrid CTC-Attention model for speech recognition

Recently, end-to-end speech recognition with a hybrid model consisting of the connectionist temporal classification(CTC) and the attention encoder-decoder achieved state-of-the-art results. In this paper, we propose a novel CTC decoder…

Sound · Computer Science 2018-11-02 Zhe Yuan , Zhuoran Lyu , Jiwei Li , Xi Zhou

Streaming Audio-Visual Speech Recognition with Alignment Regularization

In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer…

Audio and Speech Processing · Electrical Eng. & Systems 2023-07-04 Pingchuan Ma , Niko Moritz , Stavros Petridis , Christian Fuegen , Maja Pantic

Robust end-to-end deep audiovisual speech recognition

Speech is one of the most effective ways of communication among humans. Even though audio is the most common way of transmitting speech, very important information can be found in other modalities, such as vision. Vision is particularly…

Computation and Language · Computer Science 2016-11-22 Ramon Sanabria , Florian Metze , Fernando De La Torre

Advances in Joint CTC-Attention based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM

We present a state-of-the-art end-to-end Automatic Speech Recognition (ASR) model. We learn to listen and write characters with a joint Connectionist Temporal Classification (CTC) and attention-based encoder-decoder network. The encoder is…

Computation and Language · Computer Science 2017-06-12 Takaaki Hori , Shinji Watanabe , Yu Zhang , William Chan

Improved Mask-CTC for Non-Autoregressive End-to-End ASR

For real-world deployment of automatic speech recognition (ASR), the system is desired to be capable of fast inference while relieving the requirement of computational resources. The recently proposed end-to-end ASR system based on…

Audio and Speech Processing · Electrical Eng. & Systems 2021-02-17 Yosuke Higuchi , Hirofumi Inaguma , Shinji Watanabe , Tetsuji Ogawa , Tetsunori Kobayashi

Fusing information streams in end-to-end audio-visual speech recognition

End-to-end acoustic speech recognition has quickly gained widespread popularity and shows promising results in many studies. Specifically the joint transformer/CTC model provides very good performance in many tasks. However, under noisy and…

Audio and Speech Processing · Electrical Eng. & Systems 2021-04-20 Wentao Yu , Steffen Zeiler , Dorothea Kolossa

Towards A Unified Conformer Structure: from ASR to ASV Task

Transformer has achieved extraordinary performance in Natural Language Processing and Computer Vision tasks thanks to its powerful self-attention mechanism, and its variant Conformer has become a state-of-the-art architecture in the field…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-18 Dexin Liao , Tao Jiang , Feng Wang , Lin Li , Qingyang Hong

A Conformer Based Acoustic Model for Robust Automatic Speech Recognition

This study addresses robust automatic speech recognition (ASR) by introducing a Conformer-based acoustic model. The proposed model builds on the wide residual bi-directional long short-term memory network (WRBN) with utterance-wise dropout…

Sound · Computer Science 2022-10-21 Yufeng Yang , Peidong Wang , DeLiang Wang

Linguistic-Enhanced Transformer with CTC Embedding for Speech Recognition

The recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR). The improvement largely lies in the modeling of linguistic information by decoder. The decoder joint-optimized with an…

Computation and Language · Computer Science 2022-10-27 Xulong Zhang , Jianzong Wang , Ning Cheng , Mengyuan Zhao , Zhiyong Zhang , Jing Xiao

Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers

Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid…

Audio and Speech Processing · Electrical Eng. & Systems 2021-07-09 Huahuan Zheng , Wenjie Peng , Zhijian Ou , Jinsong Zhang

Joint CTC-Attention based End-to-End Speech Recognition using Multi-task Learning

Recently, there has been an increasing interest in end-to-end speech recognition that directly transcribes speech to text without any predefined alignments. One approach is the attention-based encoder-decoder framework that learns a mapping…

Computation and Language · Computer Science 2017-02-02 Suyoun Kim , Takaaki Hori , Shinji Watanabe

Multi-encoder multi-resolution framework for end-to-end speech recognition

Attention-based methods and Connectionist Temporal Classification (CTC) network have been promising research directions for end-to-end Automatic Speech Recognition (ASR). The joint CTC/Attention model has achieved great success by utilizing…

Computation and Language · Computer Science 2018-11-13 Ruizhi Li , Xiaofei Wang , Sri Harish Mallidi , Takaaki Hori , Shinji Watanabe , Hynek Hermansky

A Conformer-based Waveform-domain Neural Acoustic Echo Canceller Optimized for ASR Accuracy

Acoustic Echo Cancellation (AEC) is essential for accurate recognition of queries spoken to a smart speaker that is playing out audio. Previous work has shown that a neural AEC model operating on log-mel spectral features (denoted "logmel"…

Audio and Speech Processing · Electrical Eng. & Systems 2022-05-10 Sankaran Panchapagesan , Arun Narayanan , Turaj Zakizadeh Shabestary , Shuai Shao , Nathan Howard , Alex Park , James Walker , Alexander Gruenstein

Improving non-autoregressive end-to-end speech recognition with pre-trained acoustic and language models

While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems…

Audio and Speech Processing · Electrical Eng. & Systems 2022-01-27 Keqi Deng , Zehui Yang , Shinji Watanabe , Yosuke Higuchi , Gaofeng Cheng , Pengyuan Zhang

Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition

The Conformer model is an excellent architecture for speech recognition modeling that effectively utilizes the hybrid losses of connectionist temporal classification (CTC) and attention to train model parameters. To improve the decoding…

Sound · Computer Science 2022-04-11 Nick J. C. Wang , Zongfeng Quan , Shaojun Wang , Jing Xiao

Conformer-Based Speech Recognition On Extreme Edge-Computing Devices

With increasingly more powerful compute capabilities and resources in today's devices, traditionally compute-intensive automatic speech recognition (ASR) has been moving from the cloud to devices to better protect user privacy. However, it…

Machine Learning · Computer Science 2024-05-15 Mingbin Xu , Alex Jin , Sicheng Wang , Mu Su , Tim Ng , Henry Mason , Shiyi Han , Zhihong Lei , Yaqiao Deng , Zhen Huang , Mahesh Krishnamoorthy