Related papers: On Separate Normalization in Self-supervised Trans…

Inceptive Transformers: Enhancing Contextual Representations through Multi-Scale Feature Learning Across Domains and Languages

Encoder transformer models compress information from all tokens in a sequence into a single [CLS] token to represent global context. This approach risks diluting fine-grained or hierarchical features, leading to information loss in…

Computation and Language · Computer Science 2025-09-23 Asif Shahriar , Rifat Shahriyar , M Saifur Rahman

Revisiting [CLS] and Patch Token Interaction in Vision Transformers

Vision Transformers have emerged as powerful, scalable and versatile representation learners. To capture both global and local features, a learnable [CLS] class token is typically prepended to the input sequence of patch tokens. Despite…

Computer Vision and Pattern Recognition · Computer Science 2026-02-10 Alexis Marouani , Oriane Siméoni , Hervé Jégou , Piotr Bojanowski , Huy V. Vo

Normalizing the Normalizers: Comparing and Extending Network Normalization Schemes

Normalization techniques have only recently begun to be exploited in supervised learning tasks. Batch normalization exploits mini-batch statistics to normalize the activations. This was shown to speed up training and result in better…

Machine Learning · Computer Science 2017-03-08 Mengye Ren , Renjie Liao , Raquel Urtasun , Fabian H. Sinz , Richard S. Zemel

Improving BERT Fine-tuning with Embedding Normalization

Large pre-trained sentence encoders like BERT start a new chapter in natural language processing. A common practice to apply pre-trained BERT to sequence classification tasks (e.g., classification of sentences or sentence pairs) is by…

Computation and Language · Computer Science 2020-02-26 Wenxuan Zhou , Junyi Du , Xiang Ren

Master Thesis: Neural Sign Language Translation by Learning Tokenization

In this thesis, we propose a multitask learning based method to improve Neural Sign Language Translation (NSLT) consisting of two parts, a tokenization layer and Neural Machine Translation (NMT). The tokenization part focuses on how Sign…

Computation and Language · Computer Science 2020-11-19 Alptekin Orbay

Sequence Repetition Enhances Token Embeddings and Improves Sequence Labeling with Decoder-only Language Models

Modern language models (LMs) are trained in an autoregressive manner, conditioned only on the prefix. In contrast, sequence labeling (SL) tasks assign labels to each individual input token, naturally benefiting from bidirectional context.…

Computation and Language · Computer Science 2026-01-27 Matija Luka Kukić , Marko Čuljak , David Dukić , Martin Tutek , Jan Šnajder

Multi-Label Self-Supervised Learning with Scene Images

Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module. This paper shows that…

Computer Vision and Pattern Recognition · Computer Science 2023-10-02 Ke Zhu , Minghao Fu , Jianxin Wu

Pseudo Labelling for Enhanced Masked Autoencoders

Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating…

Computer Vision and Pattern Recognition · Computer Science 2024-06-26 Srinivasa Rao Nandam , Sara Atito , Zhenhua Feng , Josef Kittler , Muhammad Awais

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and…

Machine Learning · Computer Science 2023-10-27 Zixuan Jiang , Jiaqi Gu , Hanqing Zhu , David Z. Pan

Token Masking Improves Transformer-Based Text Classification

While transformer-based models achieve strong performance on text classification, we explore whether masking input tokens can further enhance their effectiveness. We propose token masking regularization, a simple yet theoretically motivated…

Computation and Language · Computer Science 2025-05-20 Xianglong Xu , John Bowen , Rojin Taheri

Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks

Transformer networks have seen great success in natural language processing and machine vision, where task objectives such as next word prediction and image classification benefit from nuanced context sensitivity across high-dimensional…

Machine Learning · Computer Science 2022-12-13 Yuxuan Li , James L. McClelland

Enhanced Graph Transformer with Serialized Graph Tokens

Transformers have demonstrated success in graph learning, particularly for node-level tasks. However, existing methods encounter an information bottleneck when generating graph-level representations. The prevalent single token paradigm…

Machine Learning · Computer Science 2026-02-11 Ruixiang Wang , Yuyang Hong , Shiming Xiang , Chunhong Pan

Joint Optimization of Tokenization and Downstream Model

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the…

Computation and Language · Computer Science 2021-05-27 Tatsuya Hiraoka , Sho Takase , Kei Uchiumi , Atsushi Keyaki , Naoaki Okazaki

Sign Language Transformers: Joint End-to-end Sign Language Recognition and Translation

Prior work on Sign Language Translation has shown that having a mid-level sign gloss representation (effectively recognizing the individual signs) improves the translation performance drastically. In fact, the current state-of-the-art in…

Computer Vision and Pattern Recognition · Computer Science 2020-04-02 Necati Cihan Camgoz , Oscar Koller , Simon Hadfield , Richard Bowden

Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency

Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. However, both the training and inference process of these models may…

Computation and Language · Computer Science 2021-05-04 Jinchuan Tian , Rongzhi Gu , Helin Wang , Yuexian Zou

Learning Language-Specific Layers for Multilingual Machine Translation

Multilingual Machine Translation promises to improve translation quality between non-English languages. This is advantageous for several reasons, namely lower latency (no need to translate twice), and reduced error cascades (e.g., avoiding…

Computation and Language · Computer Science 2023-05-05 Telmo Pessoa Pires , Robin M. Schmidt , Yi-Hsiu Liao , Stephan Peitz

LMK > CLS: Landmark Pooling for Dense Embeddings

Representation learning is central to many downstream tasks such as search, clustering, classification, and reranking. State-of-the-art sequence encoders typically collapse a variable-length token sequence to a single vector using a pooling…

Computation and Language · Computer Science 2026-01-30 Meet Doshi , Aashka Trivedi , Vishwajeet Kumar , Parul Awasthy , Yulong Li , Jaydeep Sen , Radu Florian , Sachindra Joshi

Encoding Multi-Domain Scientific Papers by Ensembling Multiple CLS Tokens

Many useful tasks on scientific documents, such as topic classification and citation prediction, involve corpora that span multiple scientific domains. Typically, such tasks are accomplished by representing the text with a vector embedding…

Computation and Language · Computer Science 2023-09-11 Ronald Seoh , Haw-Shiuan Chang , Andrew McCallum

MCTformer+: Multi-Class Token Transformer for Weakly Supervised Semantic Segmentation

This paper proposes a novel transformer-based framework that aims to enhance weakly supervised semantic segmentation (WSSS) by generating accurate class-specific object localization maps as pseudo labels. Building upon the observation that…

Computer Vision and Pattern Recognition · Computer Science 2023-08-08 Lian Xu , Mohammed Bennamoun , Farid Boussaid , Hamid Laga , Wanli Ouyang , Dan Xu

An Empirical Study Of Self-supervised Learning Approaches For Object Detection With Transformers

Self-supervised learning (SSL) methods such as masked language modeling have shown massive performance gains by pretraining transformer models for a variety of natural language processing tasks. The follow-up research adapted similar…

Computer Vision and Pattern Recognition · Computer Science 2022-05-12 Gokul Karthik Kumar , Sahal Shaji Mullappilly , Abhishek Singh Gehlot