Related papers: Transformers Provably Learn Sparse Token Selection…

Transformer Learns Optimal Variable Selection in Group-Sparse Classification

Transformers have demonstrated remarkable success across various applications. However, the success of transformers have not been understood in theory. In this work, we give a case study of how transformers can be trained to learn a classic…

Machine Learning · Statistics 2025-04-14 Chenyang Zhang , Xuran Meng , Yuan Cao

When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective

Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative…

Machine Learning · Statistics 2025-03-17 Alireza Mousavi-Hosseini , Clayton Sanford , Denny Wu , Murat A. Erdogdu

Transformers Provably Learn Sparse XOR with Polylogarithmic Parameters

Learning sparse parity functions has become a theoretical testbed for studying feature learning in neural networks. However, existing analyses primarily focus on Feed-Forward Neural Networks (FFNNs). Meanwhile, theoretical understanding of…

Machine Learning · Computer Science 2026-05-12 Yaomengxi Han , Debarghya Ghoshdastidar

Transformers Trained via Gradient Descent Can Provably Learn a Class of Teacher Models

Transformers have achieved great success across a wide range of applications, yet the theoretical foundations underlying their success remain largely unexplored. To demystify the strong capacities of transformers applied to versatile…

Machine Learning · Computer Science 2026-03-25 Chenyang Zhang , Qingyue Zhao , Quanquan Gu , Yuan Cao

Learning on Transformers is Provable Low-Rank and Sparse: A One-layer Analysis

Efficient training and inference algorithms, such as low-rank adaption and model pruning, have shown impressive performance for learning Transformer-based large foundation models. However, due to the technical challenges of the non-convex…

Machine Learning · Computer Science 2024-06-26 Hongkang Li , Meng Wang , Shuai Zhang , Sijia Liu , Pin-Yu Chen

Explore the Knowledge contained in Network Weights to Obtain Sparse Neural Networks

Sparse neural networks are important for achieving better generalization and enhancing computation efficiency. This paper proposes a novel learning approach to obtain sparse fully connected layers in neural networks (NNs) automatically. We…

Machine Learning · Computer Science 2021-04-28 Mengqiao Han , Xiabi Liu , Zhaoyang Hai , Zhengwen Li

A Theory for Compressibility of Graph Transformers for Transductive Learning

Transductive tasks on graphs differ fundamentally from typical supervised machine learning tasks, as the independent and identically distributed (i.i.d.) assumption does not hold among samples. Instead, all train/test/validation samples are…

Machine Learning · Computer Science 2024-11-21 Hamed Shirzad , Honghao Lin , Ameya Velingker , Balaji Venkatachalam , David Woodruff , Danica Sutherland

Finding trainable sparse networks through Neural Tangent Transfer

Deep neural networks have dramatically transformed machine learning, but their memory and energy demands are substantial. The requirements of real biological neural networks are rather modest in comparison, and one feature that might…

Machine Learning · Computer Science 2020-07-27 Tianlin Liu , Friedemann Zenke

White-Box Transformers via Sparse Rate Reduction

In this paper, we contend that the objective of representation learning is to compress and transform the distribution of the data, say sets of tokens, towards a mixture of low-dimensional Gaussian distributions supported on incoherent…

Machine Learning · Computer Science 2023-06-05 Yaodong Yu , Sam Buchanan , Druv Pai , Tianzhe Chu , Ziyang Wu , Shengbang Tong , Benjamin D. Haeffele , Yi Ma

Sparse Array Selection Across Arbitrary Sensor Geometries with Deep Transfer Learning

Sparse sensor array selection arises in many engineering applications, where it is imperative to obtain maximum spatial resolution from a limited number of array elements. Recent research shows that computational complexity of array…

Signal Processing · Electrical Eng. & Systems 2020-06-03 Ahmet M. Elbir , Kumar Vijay Mishra

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple…

Machine Learning · Computer Science 2024-06-05 Xiang Cheng , Yuxin Chen , Suvrit Sra

Learning Filter Bank Sparsifying Transforms

Data is said to follow the transform (or analysis) sparsity model if it becomes sparse when acted on by a linear operator called a sparsifying transform. Several algorithms have been designed to learn such a transform directly from data,…

Machine Learning · Statistics 2019-01-30 Luke Pfister , Yoram Bresler

Transformers learn to implement preconditioned gradient descent for in-context learning

Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of…

Machine Learning · Computer Science 2023-11-13 Kwangjun Ahn , Xiang Cheng , Hadi Daneshmand , Suvrit Sra

Towards Understanding Transformers in Learning Random Walks

Transformers have proven highly effective across various applications, especially in handling sequential data such as natural languages and time series. However, transformer models often lack clear interpretability, and the success of…

Machine Learning · Computer Science 2025-12-01 Wei Shi , Yuan Cao

Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block

In recent developments in the field of Computer Vision, a rise is seen in the use of transformer-based architectures. They are surpassing the state-of-the-art set by CNN architectures in accuracy but on the other hand, they are…

Computer Vision and Pattern Recognition · Computer Science 2021-10-12 Durvesh Malpure , Onkar Litake , Rajesh Ingle

Random Sparse Lifts: Construction, Analysis and Convergence of finite sparse networks

We present a framework to define a large class of neural networks for which, by construction, training by gradient flow provably reaches arbitrarily low loss when the number of parameters grows. Distinct from the fixed-space global…

Optimization and Control · Mathematics 2025-01-13 David A. R. Robin , Kevin Scaman , Marc Lelarge

Transformers as Multi-task Learners: Decoupling Features in Hidden Markov Models

Transformer based models have shown remarkable capabilities in sequence learning across a wide range of tasks, often performing well on specific task by leveraging input-output examples. Despite their empirical success, a comprehensive…

Machine Learning · Computer Science 2025-06-03 Yifan Hao , Chenlu Ye , Chi Han , Tong Zhang

What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks

We study the capabilities of the transformer architecture with varying depth. Specifically, we designed a novel set of sequence learning tasks to systematically evaluate and comprehend how the depth of transformer affects its ability to…

Machine Learning · Computer Science 2024-04-03 Xingwu Chen , Difan Zou

Are Sparse Neural Networks Better Hard Sample Learners?

While deep learning has demonstrated impressive progress, it remains a daunting challenge to learn from hard samples as these samples are usually noisy and intricate. These hard samples play a crucial role in the optimal performance of deep…

Computer Vision and Pattern Recognition · Computer Science 2024-12-30 Qiao Xiao , Boqian Wu , Lu Yin , Christopher Neil Gadzinski , Tianjin Huang , Mykola Pechenizkiy , Decebal Constantin Mocanu

Building Sparse Deep Feedforward Networks using Tree Receptive Fields

Sparse connectivity is an important factor behind the success of convolutional neural networks and recurrent neural networks. In this paper, we consider the problem of learning sparse connectivity for feedforward neural networks (FNNs). The…

Machine Learning · Computer Science 2018-07-18 Xiaopeng Li , Zhourong Chen , Nevin L. Zhang