Related papers: Improving BERT with Self-Supervised Attention

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Fine-tuning pre-trained language models like BERT has become an effective way in NLP and yields state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure,…

Computation and Language · Computer Science 2020-02-25 Yige Xu , Xipeng Qiu , Ligao Zhou , Xuanjing Huang

Optimizing small BERTs trained for German NER

Currently, the most widespread neural network architecture for training language models is the so called BERT which led to improvements in various Natural Language Processing (NLP) tasks. In general, the larger the number of parameters in a…

Computation and Language · Computer Science 2021-11-02 Jochen Zöllner , Konrad Sperfeld , Christoph Wick , Roger Labahn

Improving BERT with Syntax-aware Local Attention

Pre-trained Transformer-based neural language models, such as BERT, have achieved remarkable results on varieties of NLP tasks. Recent works have shown that attention-based models can benefit from more focused attention over local regions.…

Computation and Language · Computer Science 2021-05-25 Zhongli Li , Qingyu Zhou , Chao Li , Ke Xu , Yunbo Cao

ESIE-BERT: Enriching Sub-words Information Explicitly with BERT for Joint Intent Classification and SlotFilling

Natural language understanding (NLU) has two core tasks: intent classification and slot filling. The success of pre-training language models resulted in a significant breakthrough in the two tasks. One of the promising solutions called BERT…

Computation and Language · Computer Science 2023-02-03 Yu Guo , Zhilong Xie , Xingyan Chen , Huangen Chen , Leilei Wang , Huaming Du , Shaopeng Wei , Yu Zhao , Qing Li , Gang Wu

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations and longer…

Computation and Language · Computer Science 2020-02-11 Zhenzhong Lan , Mingda Chen , Sebastian Goodman , Kevin Gimpel , Piyush Sharma , Radu Soricut

An Unsupervised Sentence Embedding Method by Mutual Information Maximization

BERT is inefficient for sentence-pair tasks such as clustering or semantic search as it needs to evaluate combinatorially many sentence pairs which is very time-consuming. Sentence BERT (SBERT) attempted to solve this challenge by learning…

Computation and Language · Computer Science 2021-02-08 Yan Zhang , Ruidan He , Zuozhu Liu , Kwan Hui Lim , Lidong Bing

Can BERT Refrain from Forgetting on Sequential Tasks? A Probing Study

Large pre-trained language models help to achieve state of the art on a variety of natural language processing (NLP) tasks, nevertheless, they still suffer from forgetting when incrementally learning a sequence of tasks. To alleviate this…

Computation and Language · Computer Science 2023-03-03 Mingxu Tao , Yansong Feng , Dongyan Zhao

SenseBERT: Driving Some Sense into BERT

The ability to learn from large unlabeled corpora has allowed neural language models to advance the frontier in natural language understanding. However, existing self-supervision techniques operate at the word form level, which serves as a…

Computation and Language · Computer Science 2020-05-19 Yoav Levine , Barak Lenz , Or Dagan , Ori Ram , Dan Padnos , Or Sharir , Shai Shalev-Shwartz , Amnon Shashua , Yoav Shoham

DABERT: Dual Attention Enhanced BERT for Semantic Matching

Transformer-based pre-trained language models such as BERT have achieved remarkable results in Semantic Sentence Matching. However, existing models still suffer from insufficient ability to capture subtle differences. Minor noise like word…

Computation and Language · Computer Science 2023-04-17 Sirui Wang , Di Liang , Jian Song , Yuntao Li , Wei Wu

Students Need More Attention: BERT-based AttentionModel for Small Data with Application to AutomaticPatient Message Triage

Small and imbalanced datasets commonly seen in healthcare represent a challenge when training classifiers based on deep learning models. So motivated, we propose a novel framework based on BioBERT (Bidirectional Encoder Representations from…

Computation and Language · Computer Science 2020-06-23 Shijing Si , Rui Wang , Jedrek Wosik , Hao Zhang , David Dov , Guoyin Wang , Ricardo Henao , Lawrence Carin

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks

Contextualized representations from a pre-trained language model are central to achieve a high performance on downstream NLP task. The pre-trained BERT and A Lite BERT (ALBERT) models can be fine-tuned to give state-ofthe-art results in…

Computation and Language · Computer Science 2021-01-27 Hyunjin Choi , Judong Kim , Seongho Joe , Youngjune Gwon

Adversarial Self-Attention for Language Understanding

Deep neural models (e.g. Transformer) naturally learn spurious features, which create a ``shortcut'' between the labels and inputs, thus impairing the generalization and robustness. This paper advances the self-attention mechanism to its…

Computation and Language · Computer Science 2023-02-09 Hongqiu Wu , Ruixue Ding , Hai Zhao , Pengjun Xie , Fei Huang , Min Zhang

Sensi-BERT: Towards Sensitivity Driven Fine-Tuning for Parameter-Efficient BERT

Large pre-trained language models have recently gained significant traction due to their improved performance on various down-stream tasks like text classification and question answering, requiring only few epochs of fine-tuning. However,…

Computation and Language · Computer Science 2023-09-01 Souvik Kundu , Sharath Nittur Sridhar , Maciej Szankin , Sairam Sundaresan

SesameBERT: Attention for Anywhere

Fine-tuning with pre-trained models has achieved exceptional results for many language tasks. In this study, we focused on one such self-attention network model, namely BERT, which has performed well in terms of stacking layers across…

Computation and Language · Computer Science 2019-10-09 Ta-Chun Su , Hsiang-Chih Cheng

Layer-wise Guided Training for BERT: Learning Incrementally Refined Document Representations

Although BERT is widely used by the NLP community, little is known about its inner workings. Several attempts have been made to shed light on certain aspects of BERT, often with contradicting conclusions. A much raised concern focuses on…

Computation and Language · Computer Science 2020-10-13 Nikolaos Manginas , Ilias Chalkidis , Prodromos Malakasiotis

Iterative Auto-Annotation for Scientific Named Entity Recognition Using BERT-Based Models

This paper presents an iterative approach to performing Scientific Named Entity Recognition (SciNER) using BERT-based models. We leverage transfer learning to fine-tune pretrained models with a small but high-quality set of manually…

Computation and Language · Computer Science 2025-02-25 Kartik Gupta

Position-Aware Self-Attention based Neural Sequence Labeling

Sequence labeling is a fundamental task in natural language processing and has been widely studied. Recently, RNN-based sequence labeling models have increasingly gained attentions. Despite superior performance achieved by learning the long…

Computation and Language · Computer Science 2021-10-19 Wei Wei , Zanbo Wang , Xianling Mao , Guangyou Zhou , Pan Zhou , Sheng Jiang

Revealing the Dark Secrets of BERT

BERT-based architectures currently give state-of-the-art performance on many NLP tasks, but little is known about the exact mechanisms that contribute to its success. In the current work, we focus on the interpretation of self-attention,…

Computation and Language · Computer Science 2019-09-12 Olga Kovaleva , Alexey Romanov , Anna Rogers , Anna Rumshisky

SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space

Sparse attention reduces the quadratic complexity of full self-attention but faces two challenges: (1) an attention gap, where applying sparse attention to full-attention-trained models causes performance degradation due to train-inference…

Computation and Language · Computer Science 2026-02-02 Zhenyi Shen , Junru Lu , Lin Gui , Jiazheng Li , Yulan He , Di Yin , Xing Sun

What Does BERT Look At? An Analysis of BERT's Attention

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused…

Computation and Language · Computer Science 2019-06-12 Kevin Clark , Urvashi Khandelwal , Omer Levy , Christopher D. Manning