Related papers: Answer Fast: Accelerating BERT on the Tensor Strea…

Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous…

Computation and Language · Computer Science 2021-02-23 Dave Dice , Alex Kogan

Demystifying BERT: Implications for Accelerator Design

Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging…

Hardware Architecture · Computer Science 2021-04-20 Suchita Pati , Shaizeen Aga , Nuwan Jayasena , Matthew D. Sinclair

Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. However, such models are usually slow at inference time, making deployment difficult. In this paper, we develop an efficient…

Machine Learning · Computer Science 2020-08-18 Henry Tsai , Jayden Ooi , Chun-Sung Ferng , Hyung Won Chung , Jason Riesa

Teaching BERT to Wait: Balancing Accuracy and Latency for Streaming Disfluency Detection

In modern interactive speech-based systems, speech is consumed and transcribed incrementally prior to having disfluencies removed. This post-processing step is crucial for producing clean transcripts and high performance on downstream tasks…

Computation and Language · Computer Science 2022-05-03 Angelica Chen , Vicky Zayats , Daniel D. Walker , Dirk Padfield

Advancements in Natural Language Processing: Exploring Transformer-Based Architectures for Text Understanding

Natural Language Processing (NLP) has witnessed a transformative leap with the advent of transformer-based architectures, which have significantly enhanced the ability of machines to understand and generate human-like text. This paper…

Computation and Language · Computer Science 2025-03-27 Tianhao Wu , Yu Wang , Ngoc Quach

Improving Fast-slow Encoder based Transducer with Streaming Deliberation

This paper introduces a fast-slow encoder based transducer with streaming deliberation for end-to-end automatic speech recognition. We aim to improve the recognition accuracy of the fast-slow encoder based transducer while keeping its…

Audio and Speech Processing · Electrical Eng. & Systems 2022-12-16 Ke Li , Jay Mahadeokar , Jinxi Guo , Yangyang Shi , Gil Keren , Ozlem Kalinli , Michael L. Seltzer , Duc Le

Hierarchical Transformers for Long Document Classification

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its…

Computation and Language · Computer Science 2019-10-25 Raghavendra Pappagari , Piotr Żelasko , Jesús Villalba , Yishay Carmiel , Najim Dehak

Developing Real-time Streaming Transformer Transducer for Speech Recognition on Large-scale Dataset

Recently, Transformer based end-to-end models have achieved great success in many areas including speech recognition. However, compared to LSTM models, the heavy computational cost of the Transformer during inference is a key issue to…

Computation and Language · Computer Science 2021-03-02 Xie Chen , Yu Wu , Zhenghao Wang , Shujie Liu , Jinyu Li

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a…

Computation and Language · Computer Science 2022-03-18 Ali Modarressi , Hosein Mohebbi , Mohammad Taher Pilehvar

Non-autoregressive Transformer-based End-to-end ASR using BERT

Transformer-based models have led to significant innovation in classical and practical subjects as varied as speech processing, natural language processing, and computer vision. On top of the Transformer, attention-based end-to-end…

Computation and Language · Computer Science 2022-05-19 Fu-Hao Yu , Kuan-Yu Chen

LightSeq2: Accelerated Training for Transformer-based Models on GPUs

Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and…

Computation and Language · Computer Science 2022-06-17 Xiaohui Wang , Yang Wei , Ying Xiong , Guyue Huang , Xian Qian , Yufei Ding , Mingxuan Wang , Lei Li

LightSeq: A High Performance Inference Library for Transformers

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose…

Mathematical Software · Computer Science 2021-04-23 Xiaohui Wang , Ying Xiong , Yang Wei , Mingxuan Wang , Lei Li

FastBERT: a Self-distilling BERT with Adaptive Inference Time

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To…

Computation and Language · Computer Science 2020-04-30 Weijie Liu , Peng Zhou , Zhe Zhao , Zhiruo Wang , Haotang Deng , Qi Ju

Exponentially Faster Language Modelling

Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with…

Computation and Language · Computer Science 2023-11-22 Peter Belcak , Roger Wattenhofer

TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach…

Computation and Language · Computer Science 2021-05-26 Deming Ye , Yankai Lin , Yufei Huang , Maosong Sun

Making Neural Machine Reading Comprehension Faster

This study aims at solving the Machine Reading Comprehension problem where questions have to be answered given a context passage. The challenge is to develop a computationally faster model which will have improved inference time. State of…

Computation and Language · Computer Science 2019-04-02 Debajyoti Chatterjee

Streaming Simultaneous Speech Translation with Augmented Memory Transformer

Transformer-based models have achieved state-of-the-art performance on speech translation tasks. However, the model architecture is not efficient enough for streaming scenarios since self-attention is computed over an entire input sequence…

Computation and Language · Computer Science 2020-11-03 Xutai Ma , Yongqiang Wang , Mohammad Javad Dousti , Philipp Koehn , Juan Pino

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have…

Computation and Language · Computer Science 2024-12-20 Benjamin Warner , Antoine Chaffin , Benjamin Clavié , Orion Weller , Oskar Hallström , Said Taghadouini , Alexis Gallagher , Raja Biswas , Faisal Ladhak , Tom Aarsen , Nathan Cooper , Griffin Adams , Jeremy Howard , Iacopo Poli

A Survey of Techniques for Optimizing Transformer Inference

Recent years have seen a phenomenal rise in performance and applications of transformer neural networks. The family of transformer networks, including Bidirectional Encoder Representations from Transformer (BERT), Generative Pretrained…

Machine Learning · Computer Science 2023-07-18 Krishna Teja Chitty-Venkata , Sparsh Mittal , Murali Emani , Venkatram Vishwanath , Arun K. Somani

On the Transformer Growth for Progressive BERT Training

Due to the excessive cost of large-scale language model pre-training, considerable efforts have been made to train BERT progressively -- start from an inferior but low-cost model and gradually grow the model to increase the computational…

Computation and Language · Computer Science 2021-07-13 Xiaotao Gu , Liyuan Liu , Hongkun Yu , Jing Li , Chen Chen , Jiawei Han