Related papers: CSA-Trans: Code Structure Aware Transformer for AS…

Self-Attention in Colors: Another Take on Encoding Graph Structure in Transformers

We introduce a novel self-attention mechanism, which we call CSA (Chromatic Self-Attention), which extends the notion of attention scores to attention _filters_, independently modulating the feature channels. We showcase CSA in a…

Machine Learning · Computer Science 2023-04-24 Romain Menegaux , Emmanuel Jehanno , Margot Selosse , Julien Mairal

Understanding Long Programming Languages with Structure-Aware Sparse Attention

Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically…

Computation and Language · Computer Science 2022-05-30 Tingting Liu , Chengyu Wang , Cen Chen , Ming Gao , Aoying Zhou

Structure-Aware Transformer for Graph Representation Learning

The Transformer architecture has gained growing attention in graph representation learning recently, as it naturally overcomes several limitations of graph neural networks (GNNs) by avoiding their strict structural inductive biases and…

Machine Learning · Statistics 2022-06-14 Dexiong Chen , Leslie O'Bray , Karsten Borgwardt

AST-MHSA : Code Summarization using Multi-Head Self-Attention

Code summarization aims to generate concise natural language descriptions for source code. The prevailing approaches adopt transformer-based encoder-decoder architectures, where the Abstract Syntax Tree (AST) of the source code is utilized…

Computation and Language · Computer Science 2023-08-11 Yeshwanth Nagaraj , Ujjwal Gupta

Towards Online End-to-end Transformer Automatic Speech Recognition

The Transformer self-attention network has recently shown promising performance as an alternative to recurrent neural networks in end-to-end (E2E) automatic speech recognition (ASR) systems. However, Transformer has a drawback in that the…

Audio and Speech Processing · Electrical Eng. & Systems 2019-10-29 Emiru Tsunoo , Yosuke Kashiwagi , Toshiyuki Kumakura , Shinji Watanabe

What does Transformer learn about source code?

In the field of source code processing, the transformer-based representation models have shown great powerfulness and have achieved state-of-the-art (SOTA) performance in many tasks. Although the transformer models process the sequential…

Software Engineering · Computer Science 2022-07-19 Kechi Zhang , Ge Li , Zhi Jin

Sparse Attention-Based Neural Networks for Code Classification

Categorizing source codes accurately and efficiently is a challenging problem in real-world programming education platform management. In recent years, model-based approaches utilizing abstract syntax trees (ASTs) have been widely applied…

Programming Languages · Computer Science 2023-11-14 Ziyang Xiang , Zaixi Zhang , Qi Liu

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we…

Computation and Language · Computer Science 2020-09-30 Xiaoya Li , Yuxian Meng , Mingxin Zhou , Qinghong Han , Fei Wu , Jiwei Li

In-Context Compositional Learning via Sparse Coding Transformer

Transformer architectures have achieved remarkable success across language, vision, and multimodal tasks, and there is growing demand for them to address in-context compositional learning tasks. In these tasks, models solve the target…

Machine Learning · Computer Science 2025-11-26 Wei Chen , Jingxi Yu , Zichen Miao , Qiang Qiu

Exclusive Self Attention

We introduce exclusive self attention (XSA), a simple modification of self attention (SA) that improves Transformer's sequence modeling performance. The key idea is to constrain attention to capture only information orthogonal to the…

Machine Learning · Computer Science 2026-03-11 Shuangfei Zhai

Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction

The attention operator is arguably the key distinguishing factor of transformer architectures, which have demonstrated state-of-the-art performance on a variety of tasks. However, transformer attention operators often impose a significant…

Machine Learning · Computer Science 2024-12-24 Ziyang Wu , Tianjiao Ding , Yifu Lu , Druv Pai , Jingyuan Zhang , Weida Wang , Yaodong Yu , Yi Ma , Benjamin D. Haeffele

SAMSA: Efficient Transformer for Many Data Modalities

The versatility of self-attention mechanism earned transformers great success in almost all data modalities, with limitations on the quadratic complexity and difficulty of training. Efficient transformers, on the other hand, often rely on…

Machine Learning · Computer Science 2024-08-20 Minh Lenhat , Viet Anh Nguyen , Khoa Nguyen , Duong Duc Hieu , Dao Huu Hung , Truong Son Hy

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to…

Computer Vision and Pattern Recognition · Computer Science 2021-12-30 Sitong Wu , Tianyi Wu , Haoru Tan , Guodong Guo

Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost

To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a…

Machine Learning · Computer Science 2022-10-28 Sungjun Cho , Seonwoo Min , Jinwoo Kim , Moontae Lee , Honglak Lee , Seunghoon Hong

Towards Better Multi-head Attention via Channel-wise Sample Permutation

Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA)…

Machine Learning · Computer Science 2024-10-16 Shen Yuan , Hongteng Xu

Rethinking Graph Transformers with Spectral Attention

In recent years, the Transformer architecture has proven to be very successful in sequence processing, but its application to other data structures, such as graphs, has remained limited due to the difficulty of properly defining positions.…

Machine Learning · Computer Science 2021-10-28 Devin Kreuzer , Dominique Beaini , William L. Hamilton , Vincent Létourneau , Prudencio Tossou

SignGT: Signed Attention-based Graph Transformer for Graph Representation Learning

The emerging graph Transformers have achieved impressive performance for graph representation learning over graph neural networks (GNNs). In this work, we regard the self-attention mechanism, the core module of graph Transformers, as a…

Machine Learning · Computer Science 2023-10-18 Jinsong Chen , Gaichao Li , John E. Hopcroft , Kun He

Contextual Transformer Networks for Visual Recognition

Transformer with self-attention has led to the revolutionizing of natural language processing field, and recently inspires the emergence of Transformer-style architecture design with competitive results in numerous computer vision tasks.…

Computer Vision and Pattern Recognition · Computer Science 2021-07-27 Yehao Li , Ting Yao , Yingwei Pan , Tao Mei

Attention Schema-based Attention Control (ASAC): A Cognitive-Inspired Approach for Attention Management in Transformers

Attention mechanisms have become integral in AI, significantly enhancing model performance and scalability by drawing inspiration from human cognition. Concurrently, the Attention Schema Theory (AST) in cognitive science posits that…

Artificial Intelligence · Computer Science 2025-09-22 Krati Saxena , Federico Jurado Ruiz , Guido Manzi , Dianbo Liu , Alex Lamb

SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation

We present SegNeXt, a simple convolutional network architecture for semantic segmentation. Recent transformer-based models have dominated the field of semantic segmentation due to the efficiency of self-attention in encoding spatial…

Computer Vision and Pattern Recognition · Computer Science 2022-09-20 Meng-Hao Guo , Cheng-Ze Lu , Qibin Hou , Zhengning Liu , Ming-Ming Cheng , Shi-Min Hu