Related papers: Transformers versus the EM Algorithm in Multi-clas…

Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures

The transformer architecture has demonstrated remarkable capabilities in modern artificial intelligence, among which the capability of implicitly learning an internal model during inference time is widely believed to play a key role in the…

Machine Learning · Computer Science 2026-02-10 Zhiheng Chen , Ruofan Wu , Guanhua Fang

Big Learning Expectation Maximization

Mixture models serve as one fundamental tool with versatile applications. However, their training techniques, like the popular Expectation Maximization (EM) algorithm, are notoriously sensitive to parameter initialization and often suffer…

Machine Learning · Computer Science 2023-12-20 Yulai Cong , Sijia Li

Pretraining Data Mixtures Enable Narrow Model Selection Capabilities in Transformer Models

Transformer models, notably large language models (LLMs), have the remarkable ability to perform in-context learning (ICL) -- to perform new tasks when prompted with unseen input-output examples without any explicit model training. In this…

Machine Learning · Computer Science 2023-11-03 Steve Yadlowsky , Lyric Doshi , Nilesh Tripuraneni

Robust Unsupervised Multi-task and Transfer Learning on Gaussian Mixture Models

Unsupervised learning has been widely used in many real-world applications. One of the simplest and most important unsupervised learning models is the Gaussian mixture model (GMM). In this work, we study the multi-task learning problem on…

Machine Learning · Statistics 2025-12-29 Ye Tian , Haolei Weng , Lucy Xia , Yang Feng

The Closeness of In-Context Learning and Weight Shifting for Softmax Regression

Large language models (LLMs) are known for their exceptional performance in natural language processing, making them highly effective in many human life-related or even job-related tasks. The attention mechanism in the Transformer…

Computation and Language · Computer Science 2023-04-27 Shuai Li , Zhao Song , Yu Xia , Tong Yu , Tianyi Zhou

Learning Theory of Transformers: Local-to-Global Approximation via Softmax Partition of Unity

This paper investigates the learning theory of Transformer networks for regression tasks on the compact Euclidean domain $[0,1]^d$ and $d$-dimensional compact Riemannian manifolds. We propose a novel constructive approximation framework for…

Machine Learning · Statistics 2026-05-12 Zhongjie Shi , Wenjing Liao

Learning Spectral Methods by Transformers

Transformers demonstrate significant advantages as the building block of modern LLMs. In this work, we study the capacities of Transformers in performing unsupervised learning. We show that multi-layered Transformers, given a sufficiently…

Machine Learning · Statistics 2025-01-14 Yihan He , Yuan Cao , Hong-Yu Chen , Dennis Wu , Jianqing Fan , Han Liu

Provable optimal transport with transformers: The essence of depth and prompt engineering

Despite their empirical success, the internal mechanism by which transformer models align tokens during language processing remains poorly understood. This paper provides a mechanistic and theoretical explanation of token alignment in LLMs.…

Machine Learning · Computer Science 2025-12-19 Hadi Daneshmand

Transformers Meet In-Context Learning: A Universal Approximation Theory

Large language models are capable of in-context learning, the ability to perform new tasks at test time using a handful of input-output examples, without parameter updates. We develop a universal approximation theory to elucidate how…

Machine Learning · Computer Science 2025-08-29 Gen Li , Yuchen Jiao , Yu Huang , Yuting Wei , Yuxin Chen

Analysis of a Generalized Expectation-Maximization Algorithm for Gaussian Mixture Models: A Control Systems Perspective

The Expectation-Maximization (EM) algorithm is one of the most popular methods used to solve the problem of parametric distribution-based clustering in unsupervised learning. In this paper, we propose to analyze a generalized EM (GEM)…

Optimization and Control · Mathematics 2021-05-19 Sarthak Chatterjee , Orlando Romero , Sérgio Pequito

Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions

In order to understand the in-context learning phenomenon, recent works have adopted a stylized experimental framework and demonstrated that Transformers can learn gradient-based learning algorithms for various classes of real-valued…

Machine Learning · Computer Science 2023-10-05 Satwik Bhattamishra , Arkil Patel , Phil Blunsom , Varun Kanade

Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent

Transformers have demonstrated remarkable in-context learning (ICL) capabilities. The strong ICL performance of transformers is commonly believed to arise from their ability to implicitly execute certain algorithms on the context, thereby…

Machine Learning · Computer Science 2026-05-08 Chenyang Zhang , Yuan Cao

Transformers are Minimax Optimal Nonparametric In-Context Learners

In-context learning (ICL) of large language models has proven to be a surprisingly effective method of learning a new task from only a few demonstrative examples. In this paper, we study the efficacy of ICL from the viewpoint of statistical…

Machine Learning · Statistics 2024-10-03 Juno Kim , Tai Nakamaki , Taiji Suzuki

Transformers as Measure-Theoretic Associative Memory: A Statistical Perspective and Minimax Optimality

Transformers excel through content-addressable retrieval and the ability to exploit contexts of, in principle, unbounded length. We recast associative memory at the level of probability measures, treating a context as a distribution over…

Machine Learning · Statistics 2026-02-03 Ryotaro Kawata , Taiji Suzuki

Quantum Expectation-Maximization for Gaussian Mixture Models

The Expectation-Maximization (EM) algorithm is a fundamental tool in unsupervised machine learning. It is often used as an efficient way to solve Maximum Likelihood (ML) estimation problems, especially for models with latent variables. It…

Quantum Physics · Physics 2020-07-08 Iordanis Kerenidis , Alessandro Luongo , Anupam Prakash

On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims…

Machine Learning · Computer Science 2025-05-30 Wei Shen , Ruida Zhou , Jing Yang , Cong Shen

Transformers can optimally learn regression mixture models

Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models…

Machine Learning · Computer Science 2023-11-15 Reese Pathak , Rajat Sen , Weihao Kong , Abhimanyu Das

In-Context Convergence of Transformers

Transformers have recently revolutionized many domains in modern machine learning and one salient discovery is their remarkable in-context learning capability, where models can solve an unseen task by utilizing task-specific prompts without…

Machine Learning · Computer Science 2023-10-10 Yu Huang , Yuan Cheng , Yingbin Liang

GAN-EM: GAN based EM learning framework

Expectation maximization (EM) algorithm is to find maximum likelihood solution for models having latent variables. A typical example is Gaussian Mixture Model (GMM) which requires Gaussian assumption, however, natural images are highly…

Machine Learning · Computer Science 2018-12-04 Wentian Zhao , Shaojie Wang , Zhihuai Xie , Jing Shi , Chenliang Xu

Provably Transformers Harness Multi-Concept Word Semantics for Efficient In-Context Learning

Transformer-based large language models (LLMs) have displayed remarkable creative prowess and emergence capabilities. Existing empirical studies have revealed a strong connection between these LLMs' impressive emergence abilities and their…

Machine Learning · Computer Science 2025-08-14 Dake Bu , Wei Huang , Andi Han , Atsushi Nitanda , Taiji Suzuki , Qingfu Zhang , Hau-San Wong