Related papers: Multi-Convformer: Extending Conformer with Multipl…

Conformer: Convolution-augmented Transformer for Speech Recognition

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-19 Anmol Gulati , James Qin , Chung-Cheng Chiu , Niki Parmar , Yu Zhang , Jiahui Yu , Wei Han , Shibo Wang , Zhengdong Zhang , Yonghui Wu , Ruoming Pang

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR).…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-18 Kwangyoun Kim , Felix Wu , Yifan Peng , Jing Pan , Prashant Sridhar , Kyu J. Han , Shinji Watanabe

Conformer LLMs -- Convolution Augmented Large Language Models

This work builds together two popular blocks of neural architecture, namely convolutional layers and Transformers, for large language models (LLMs). Non-causal conformers are used ubiquitously in automatic speech recognition. This work aims…

Computation and Language · Computer Science 2023-07-04 Prateek Verma

MFA-Conformer: Multi-scale Feature Aggregation Conformer for Automatic Speaker Verification

In this paper, we present Multi-scale Feature Aggregation Conformer (MFA-Conformer), an easy-to-implement, simple but effective backbone for automatic speaker verification based on the Convolution-augmented Transformer (Conformer). The…

Sound · Computer Science 2022-11-14 Yang Zhang , Zhiqiang Lv , Haibin Wu , Shanshan Zhang , Pengfei Hu , Zhiyong Wu , Hung-yi Lee , Helen Meng

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible,…

Computation and Language · Computer Science 2022-07-08 Yifan Peng , Siddharth Dalmia , Ian Lane , Shinji Watanabe

Towards A Unified Conformer Structure: from ASR to ASV Task

Transformer has achieved extraordinary performance in Natural Language Processing and Computer Vision tasks thanks to its powerful self-attention mechanism, and its variant Conformer has become a state-of-the-art architecture in the field…

Audio and Speech Processing · Electrical Eng. & Systems 2023-01-18 Dexin Liao , Tao Jiang , Feng Wang , Lin Li , Qingyang Hong

DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition

Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-19 Jiamin Xie , John H. L. Hansen

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are…

Computation and Language · Computer Science 2023-05-30 Florian Mai , Juan Zuluaga-Gomez , Titouan Parcollet , Petr Motlicek

Attention or Convolution: Transformer Encoders in Audio Language Models for Inference Efficiency

In this paper, we show that a simple self-supervised pre-trained audio model can achieve comparable inference efficiency to more complicated pre-trained models with speech transformer encoders. These speech transformers rely on mixing…

Sound · Computer Science 2024-02-09 Sungho Jeon , Ching-Feng Yeh , Hakan Inan , Wei-Ning Hsu , Rashi Rungta , Yashar Mehdad , Daniel Bikel

End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis

We present an end-to-end multichannel speaker-attributed automatic speech recognition (MC-SA-ASR) system that combines a Conformer-based encoder with multi-frame crosschannel attention and a speaker-attributed Transformer-based decoder. To…

Computation and Language · Computer Science 2023-10-17 Can Cui , Imran Ahamad Sheikh , Mostafa Sadeghi , Emmanuel Vincent

Advanced Long-context End-to-end Speech Recognition Using Context-expanded Transformers

This paper addresses end-to-end automatic speech recognition (ASR) for long audio recordings such as lecture and conversational speeches. Most end-to-end ASR models are designed to recognize independent utterances, but contextual…

Computation and Language · Computer Science 2021-04-20 Takaaki Hori , Niko Moritz , Chiori Hori , Jonathan Le Roux

Deep Sparse Conformer for Speech Recognition

Conformer has achieved impressive results in Automatic Speech Recognition (ASR) by leveraging transformer's capturing of content-based global interactions and convolutional neural network's exploiting of local features. In Conformer, two…

Computation and Language · Computer Science 2022-09-02 Xianchao Wu

Multiresolution and Multimodal Speech Recognition with Transformers

This paper presents an audio visual automatic speech recognition (AV-ASR) system using a Transformer-based architecture. We particularly focus on the scene context provided by the visual information, to ground the ASR. We extract…

Audio and Speech Processing · Electrical Eng. & Systems 2020-05-01 Georgios Paraskevopoulos , Srinivas Parthasarathy , Aparna Khare , Shiva Sundaram

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Conformer, a convolution-augmented Transformer variant, has become the de facto encoder architecture for speech processing due to its superior performance in various tasks, including automatic speech recognition (ASR), speech translation…

Computation and Language · Computer Science 2023-05-19 Yifan Peng , Kwangyoun Kim , Felix Wu , Brian Yan , Siddhant Arora , William Chen , Jiyang Tang , Suwon Shon , Prashant Sridhar , Shinji Watanabe

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series…

Audio and Speech Processing · Electrical Eng. & Systems 2022-10-18 Sehoon Kim , Amir Gholami , Albert Shaw , Nicholas Lee , Karttikeya Mangalam , Jitendra Malik , Michael W. Mahoney , Kurt Keutzer

Adaptive Convolution for CNN-based Speech Enhancement Models

Deep learning-based speech enhancement methods have significantly improved speech quality and intelligibility. Convolutional neural networks (CNNs) have been proven to be essential components of many high-performance models. In this paper,…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-11 Dahan Wang , Xiaobin Rong , Shiruo Sun , Yuxiang Hu , Changbao Zhu , Jing Lu

Recent Developments on ESPnet Toolkit Boosted by Conformer

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a…

Audio and Speech Processing · Electrical Eng. & Systems 2020-10-30 Pengcheng Guo , Florian Boyer , Xuankai Chang , Tomoki Hayashi , Yosuke Higuchi , Hirofumi Inaguma , Naoyuki Kamo , Chenda Li , Daniel Garcia-Romero , Jiatong Shi , Jing Shi , Shinji Watanabe , Kun Wei , Wangyou Zhang , Yuekai Zhang

Nextformer: A ConvNeXt Augmented Conformer For End-To-End Speech Recognition

Conformer models have achieved state-of-the-art(SOTA) results in end-to-end speech recognition. However Conformer mainly focuses on temporal modeling while pays less attention on time-frequency property of speech feature. In this paper we…

Audio and Speech Processing · Electrical Eng. & Systems 2022-07-01 Yongjun Jiang , Jian Yu , Wenwen Yang , Bihong Zhang , Yanfeng Wang

Improving Mandarin Speech Recogntion with Block-augmented Transformer

Recently Convolution-augmented Transformer (Conformer) has shown promising results in Automatic Speech Recognition (ASR), outperforming the previous best published Transformer Transducer. In this work, we believe that the output information…

Computation and Language · Computer Science 2022-12-02 Xiaoming Ren , Huifeng Zhu , Liuwei Wei , Minghui Wu , Jie Hao

CMGAN: Conformer-based Metric GAN for Speech Enhancement

Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech…

Sound · Computer Science 2024-05-07 Ruizhe Cao , Sherif Abdulatif , Bin Yang