Related papers: ConvBERT: Improving BERT with Span-based Dynamic C…

GroupBERT: Enhanced Transformer Architecture with Efficient Grouped Structures

Attention based language models have become a critical component in state-of-the-art natural language processing systems. However, these models have significant computational requirements, due to long training times, dense operations and…

Computation and Language · Computer Science 2021-06-11 Ivan Chelombiev , Daniel Justus , Douglas Orr , Anastasia Dietrich , Frithjof Gressmann , Alexandros Koliousis , Carlo Luschi

Improving BERT with Syntax-aware Local Attention

Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks. Recent works have shown that attention-based models can benefit from more focused attention over local regions.…

Computation and Language · Computer Science 2021-05-25 Zhongli Li , Qingyu Zhou , Chao Li , Ke Xu , Yunbo Cao

Blockwise Self-Attention for Long Document Understanding

We present BlockBERT, a lightweight and efficient BERT model for better modeling long-distance dependencies. Our model extends BERT by introducing sparse block structures into the attention matrix to reduce both memory consumption and…

Computation and Language · Computer Science 2020-11-03 Jiezhong Qiu , Hao Ma , Omer Levy , Scott Wen-tau Yih , Sinong Wang , Jie Tang

DecBERT: Enhancing the Language Understanding of BERT with Causal Attention Masks

Since 2017, the Transformer-based models play critical roles in various downstream Natural Language Processing tasks. However, a common limitation of the attention mechanism utilized in Transformer Encoder is that it cannot automatically…

Computation and Language · Computer Science 2022-04-20 Ziyang Luo , Yadong Xi , Jing Ma , Zhiwei Yang , Xiaoxi Mao , Changjie Fan , Rongsheng Zhang

Pay Less Attention with Lightweight and Dynamic Convolutions

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight…

Computation and Language · Computer Science 2019-02-26 Felix Wu , Angela Fan , Alexei Baevski , Yann N. Dauphin , Michael Auli

Dynamic Adaptive Attention and Supervised Contrastive Learning: A Novel Hybrid Framework for Text Sentiment Classification

The exponential growth of user-generated movie reviews on digital platforms has made accurate text sentiment classification a cornerstone task in natural language processing. Traditional models, including standard BERT and recurrent…

Computation and Language · Computer Science 2026-04-14 Qingyang Li

Does Self-Attention Need Separate Weights in Transformers?

The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent…

Computation and Language · Computer Science 2025-05-05 Md Kowsher , Nusrat Jahan Prottasha , Chun-Nam Yu , Ozlem Ozmen Garibay , Niloofar Yousefi

Convolutions and Self-Attention: Re-interpreting Relative Positions in Pre-trained Language Models

In this paper, we detail the relationship between convolutions and self-attention in natural language tasks. We show that relative position embeddings in self-attention layers are equivalent to recently-proposed dynamic lightweight…

Computation and Language · Computer Science 2021-06-11 Tyler A. Chang , Yifan Xu , Weijian Xu , Zhuowen Tu

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

Pre-trained language models (PLM) have demonstrated their effectiveness for a broad range of information retrieval and natural language processing tasks. As the core part of PLM, multi-head self-attention is appealing for its ability to…

Computation and Language · Computer Science 2022-04-07 Shanshan Wang , Zhumin Chen , Zhaochun Ren , Huasheng Liang , Qiang Yan , Pengjie Ren

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

SesameBERT: Attention for Anywhere

Fine-tuning with pre-trained models has achieved exceptional results for many language tasks. In this study, we focused on one such self-attention network model, namely BERT, which has performed well in terms of stacking layers across…

Computation and Language · Computer Science 2019-10-09 Ta-Chun Su , Hsiang-Chih Cheng

What Does BERT Look At? An Analysis of BERT's Attention

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused…

Computation and Language · Computer Science 2019-06-12 Kevin Clark , Urvashi Khandelwal , Omer Levy , Christopher D. Manning

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Recent progress in pre-trained neural language models has significantly improved the performance of many natural language processing (NLP) tasks. In this paper we propose a new model architecture DeBERTa (Decoding-enhanced BERT with…

Computation and Language · Computer Science 2021-10-08 Pengcheng He , Xiaodong Liu , Jianfeng Gao , Weizhu Chen

DPBERT: Efficient Inference for BERT based on Dynamic Planning

Large-scale pre-trained language models such as BERT have contributed significantly to the development of NLP. However, those models require large computational resources, making it difficult to be applied to mobile devices where computing…

Computation and Language · Computer Science 2023-08-02 Weixin Wu , Hankz Hankui Zhuo

Attention Is (not) All You Need for Commonsense Reasoning

The recently introduced BERT model exhibits strong performance on several language understanding benchmarks. In this paper, we describe a simple re-implementation of BERT for commonsense reasoning. We show that the attentions produced by…

Computation and Language · Computer Science 2019-06-03 Tassilo Klein , Moin Nabi

Telling BERT's full story: from Local Attention to Global Aggregation

We take a deep look into the behavior of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model's behavior, we show that attention distributions…

Machine Learning · Computer Science 2021-01-15 Damian Pascual , Gino Brunner , Roger Wattenhofer

EfficientBERT: Progressively Searching Multilayer Perceptron via Warm-up Knowledge Distillation

Pre-trained language models have shown remarkable results on various NLP tasks. Nevertheless, due to their bulky size and slow inference speed, it is hard to deploy them on edge devices. In this paper, we have a critical insight that…

Computation and Language · Computer Science 2021-09-17 Chenhe Dong , Guangrun Wang , Hang Xu , Jiefeng Peng , Xiaozhe Ren , Xiaodan Liang

LV-BERT: Exploiting Layer Variety for BERT

Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by…

Computation and Language · Computer Science 2021-06-28 Weihao Yu , Zihang Jiang , Fei Chen , Qibin Hou , Jiashi Feng

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

Do Attention Heads in BERT Track Syntactic Dependencies?

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention…

Computation and Language · Computer Science 2019-11-28 Phu Mon Htut , Jason Phang , Shikha Bordia , Samuel R. Bowman