Related papers: LightSeq: A High Performance Inference Library for…

LightSeq2: Accelerated Training for Transformer-based Models on GPUs

Transformer-based neural models are used in many AI applications. Training these models is expensive, as it takes huge GPU resources and long duration. It is challenging because typical data like sentences have variable lengths, and…

Computation and Language · Computer Science 2022-06-17 Xiaohui Wang , Yang Wei , Ying Xiong , Guyue Huang , Xian Qian , Yufei Ding , Mingxuan Wang , Lei Li

FastSeq: Make Sequence Generation Faster

Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop…

Computation and Language · Computer Science 2021-07-14 Yu Yan , Fei Hu , Jiusheng Chen , Nikhil Bhendawade , Ting Ye , Yeyun Gong , Nan Duan , Desheng Cui , Bingyu Chi , Ruofei Zhang

Optimizing Inference Performance of Transformers on CPUs

The Transformer architecture revolutionized the field of natural language processing (NLP). Transformers-based models (e.g., BERT) power many important Web services, such as search, translation, question-answering, etc. While enormous…

Computation and Language · Computer Science 2021-02-23 Dave Dice , Alex Kogan

HetSeq: Distributed GPU Training on Heterogeneous Infrastructure

Modern deep learning systems like PyTorch and Tensorflow are able to train enormous models with billions (or trillions) of parameters on a distributed infrastructure. These systems require that the internal nodes have the same memory…

Distributed, Parallel, and Cluster Computing · Computer Science 2020-10-01 Yifan Ding , Nicholas Botzer , Tim Weninger

fairseq: A Fast, Extensible Toolkit for Sequence Modeling

fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and…

Computation and Language · Computer Science 2019-04-03 Myle Ott , Sergey Edunov , Alexei Baevski , Angela Fan , Sam Gross , Nathan Ng , David Grangier , Michael Auli

Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill…

Computation and Language · Computer Science 2022-05-25 Gongzheng Li , Yadong Xi , Jingzhen Ding , Duan Wang , Bai Liu , Changjie Fan , Xiaoxi Mao , Zeng Zhao

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many…

Machine Learning · Computer Science 2022-06-23 Ibrahim Ahmed , Sahil Parmar , Matthew Boyd , Michael Beidler , Kris Kang , Bill Liu , Kyle Roach , John Kim , Dennis Abts

Efficient Inference of Sub-Item Id-based Sequential Recommendation Models with Millions of Items

Transformer-based recommender systems, such as BERT4Rec or SASRec, achieve state-of-the-art results in sequential recommendation. However, it is challenging to use these models in production environments with catalogues of millions of…

Information Retrieval · Computer Science 2024-08-20 Aleksandr V. Petrov , Craig Macdonald , Nicola Tonellotto

FlashEVA: Accelerating LLM inference via Efficient Attention

Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose…

Computation and Language · Computer Science 2025-11-04 Juan Gabriel Kostelec , Qinghai Guo

FastBERT: a Self-distilling BERT with Adaptive Inference Time

Pre-trained language models like BERT have proven to be highly performant. However, they are often computationally expensive in many practical scenarios, for such heavy models can hardly be readily implemented with limited resources. To…

Computation and Language · Computer Science 2020-04-30 Weijie Liu , Peng Zhou , Zhe Zhao , Zhiruo Wang , Haotang Deng , Qi Ju

AdapLeR: Speeding up Inference by Adaptive Length Reduction

Pre-trained language models have shown stellar performance in various downstream tasks. But, this usually comes at the cost of high latency and computation, hindering their usage in resource-limited settings. In this work, we propose a…

Computation and Language · Computer Science 2022-03-18 Ali Modarressi , Hosein Mohebbi , Mohammad Taher Pilehvar

Transformer on a Diet

Transformer has been widely used thanks to its ability to capture sequence information in an efficient way. However, recent developments, such as BERT and GPT-2, deliver only heavy architectures with a focus on effectiveness. In this paper,…

Computation and Language · Computer Science 2020-02-17 Chenguang Wang , Zihao Ye , Aston Zhang , Zheng Zhang , Alexander J. Smola

Efficient Machine Translation with a BiLSTM-Attention Approach

With the rapid development of Natural Language Processing (NLP) technology, the accuracy and efficiency of machine translation have become hot topics of research. This paper proposes a novel Seq2Seq model aimed at improving translation…

Computation and Language · Computer Science 2024-11-01 Yuxu Wu , Yiren Xing

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs

Transformers have become keystone models in natural language processing over the past decade. They have achieved great popularity in deep learning applications, but the increasing sizes of the parameter spaces required by transformer models…

Machine Learning · Computer Science 2023-02-21 Yujia Zhai , Chengquan Jiang , Leyuan Wang , Xiaoying Jia , Shang Zhang , Zizhong Chen , Xin Liu , Yibo Zhu

Compressing Large-Scale Transformer-Based Models: A Case Study on BERT

Pre-trained Transformer-based models have achieved state-of-the-art performance for various Natural Language Processing (NLP) tasks. However, these models often have billions of parameters, and, thus, are too resource-hungry and…

Machine Learning · Computer Science 2021-09-29 Prakhar Ganesh , Yao Chen , Xin Lou , Mohammad Ali Khan , Yin Yang , Hassan Sajjad , Preslav Nakov , Deming Chen , Marianne Winslett

schuBERT: Optimizing Elements of BERT

Transformers \citep{vaswani2017attention} have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERT \citep{devlin2018bert} achieved state-of-the-art…

Computation and Language · Computer Science 2020-05-15 Ashish Khetan , Zohar Karnin

Finding Fast Transformers: One-Shot Neural Architecture Search by Component Composition

Transformer-based models have achieved stateof-the-art results in many tasks in natural language processing. However, such models are usually slow at inference time, making deployment difficult. In this paper, we develop an efficient…

Machine Learning · Computer Science 2020-08-18 Henry Tsai , Jayden Ooi , Chun-Sung Ferng , Hyung Won Chung , Jason Riesa

Mixed-Precision Training for NLP and Speech Recognition with OpenSeq2Seq

We present OpenSeq2Seq - a TensorFlow-based toolkit for training sequence-to-sequence models that features distributed and mixed-precision training. Benchmarks on machine translation and speech recognition tasks show that models built using…

Computation and Language · Computer Science 2018-11-22 Oleksii Kuchaiev , Boris Ginsburg , Igor Gitman , Vitaly Lavrukhin , Jason Li , Huyen Nguyen , Carl Case , Paulius Micikevicius

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

While transformer-based Large Language Models (LLMs) theoretically support massive context windows, they suffer from severe performance degradation when processing long numerical sequences. We attribute this failure to the attention…

Computation and Language · Computer Science 2026-04-10 Jie Sun , Yu Liu , Lu Han , Qiwen Deng , Xiang Shu , Yang Xiao , Xingyu Lu , Jun Zhou , Pengfei Liu , Lintao Ma , Jiancan Wu , Xiang Wang