Related papers: KDEformer: Accelerating Transformers via Kernel De…

Designing Robust Transformers using Robust Kernel Density Estimation

Recent advances in Transformer architectures have empowered their empirical success in a variety of tasks across different domains. However, existing works mainly focus on predictive accuracy and computational cost, without considering…

Machine Learning · Computer Science 2023-11-09 Xing Han , Tongzheng Ren , Tan Minh Nguyen , Khai Nguyen , Joydeep Ghosh , Nhat Ho

DiScoFormer: Plug-In Density and Score Estimation with Transformers

Estimating probability density and its score from samples remains a core problem in generative modeling, Bayesian inference, and kinetic theory. Existing methods are bifurcated: classical kernel density estimators (KDE) generalize across…

Machine Learning · Computer Science 2026-05-29 Vasily Ilin , Peter Sushko , Ranjay Krishna

DCT-Former: Efficient Self-Attention with Discrete Cosine Transform

Since their introduction the Trasformer architectures emerged as the dominating architectures for both natural language processing and, more recently, computer vision applications. An intrinsic limitation of this family of "fully-attentive"…

Machine Learning · Computer Science 2023-03-16 Carmelo Scribano , Giorgia Franchini , Marco Prato , Marko Bertogna

Even Faster Kernel Matrix Linear Algebra via Density Estimation

This paper studies the use of kernel density estimation (KDE) for linear algebraic tasks involving the kernel matrix of a collection of $n$ data points in $\mathbb R^d$. In particular, we improve upon existing algorithms for computing the…

Data Structures and Algorithms · Computer Science 2026-03-05 Rikhav Shah , Sandeep Silwal , Haike Xu

Fast Kernel Density Estimation with Density Matrices and Random Fourier Features

Kernel density estimation (KDE) is one of the most widely used nonparametric density estimation methods. The fact that it is a memory-based method, i.e., it uses the entire training data set for prediction, makes it unsuitable for most…

Machine Learning · Computer Science 2022-08-08 Joseph A. Gallego , Juan F. Osorio , Fabio A. González

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Pichao Wang , Xue Wang , Fan Wang , Ming Lin , Shuning Chang , Hao Li , Rong Jin

Data-Aware Random Feature Kernel for Transformers

Transformers excel across domains, yet their quadratic attention complexity poses a barrier to scaling. Random-feature attention, as in Performers, can reduce this cost to linear in the sequence length by approximating the softmax kernel…

Machine Learning · Computer Science 2026-03-05 Amirhossein Farzam , Hossein Mobahi , Nolan Andrew Miller , Luke Sernau

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language…

Computation and Language · Computer Science 2024-06-21 Martin Courtois , Malte Ostendorff , Leonhard Hennig , Georg Rehm

A Simple and Scalable Kernel Density Approach for Reliable Uncertainty Quantification in Atomistic Machine Learning

Machine learning models are increasingly used to predict material properties and accelerate atomistic simulations, but the reliability of their predictions depends on the representativeness of the training data. We present a scalable,…

Chemical Physics · Physics 2025-10-20 Daniel Willimetz , Lukáš Grajciar

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the…

Computer Vision and Pattern Recognition · Computer Science 2024-05-09 Ethan Smith , Nayan Saxena , Aninda Saha

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching…

Hardware Architecture · Computer Science 2025-01-15 Rya Sanovar , Srikant Bharadwaj , Renee St. Amant , Victor Rühle , Saravan Rajmohan

Compute-Efficient Medical Image Classification with Softmax-Free Transformers and Sequence Normalization

The Transformer model has been pivotal in advancing fields such as natural language processing, speech recognition, and computer vision. However, a critical limitation of this model is its quadratic computational and memory complexity…

Computer Vision and Pattern Recognition · Computer Science 2024-06-04 Firas Khader , Omar S. M. El Nahhas , Tianyu Han , Gustav Müller-Franzes , Sven Nebelung , Jakob Nikolas Kather , Daniel Truhn

Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding

The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original…

Machine Learning · Computer Science 2021-11-04 Shengjie Luo , Shanda Li , Tianle Cai , Di He , Dinglan Peng , Shuxin Zheng , Guolin Ke , Liwei Wang , Tie-Yan Liu

End-to-End Transformer Acceleration Through Processing-in-Memory Architectures

Transformers have become central to natural language processing and large language models, but their deployment at scale faces three major challenges. First, the attention mechanism requires massive matrix multiplications and frequent…

Hardware Architecture · Computer Science 2026-01-22 Xiaoxuan Yang , Peilin Chen , Tergel Molom-Ochir , Yiran Chen

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

Deep Tensor Network

The quadratic complexity of dot-product attention introduced in Transformer remains a fundamental bottleneck impeding the progress of foundation models toward unbounded context lengths. Addressing this challenge, we introduce the Deep…

Machine Learning · Computer Science 2025-09-03 Yifan Zhang

Efficient Linear Attention for Fast and Accurate Keypoint Matching

Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their…

Computer Vision and Pattern Recognition · Computer Science 2022-04-25 Suwichaya Suwanwimolkul , Satoshi Komorita

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work…

Numerical Analysis · Mathematics 2026-04-03 Michel Fabrice Serret , Alice Cortinovis , Yijun Dong , Diana Halikias , Anna Ma , Fabio Matti , Deanna Needell , Katherine J. Pearce , Elizaveta Rebrova , Disha Shur , Rudi Smith , Hai-Xiao Wang , Laura Grigori

NoiseFormer -- Noise Diffused Symmetric Attention Transformer

Transformer architecture has been very successful long runner in the field of Deep Learning (DL) and Large Language Models (LLM) because of its powerful attention-based learning and parallel-natured architecture. As the models grow gigantic…

Machine Learning · Computer Science 2026-01-21 Phani Kumar , Nyshadham , Jyothendra Varma , Polisetty V R K , Aditya Rathore

QuadTree Attention for Vision Transformers

Transformers have been successful in many vision tasks, thanks to their capability of capturing long-range dependency. However, their quadratic computational complexity poses a major obstacle for applying them to vision tasks requiring…

Computer Vision and Pattern Recognition · Computer Science 2022-03-25 Shitao Tang , Jiahui Zhang , Siyu Zhu , Ping Tan