Related papers: Probabilistic Transformers
Most expressivity results for transformers treat them as language recognizers -- devices that accept or reject strings -- rather than as they are used in practice: as language models that generate strings autoregressively and…
Mixture models arise in many regression problems, but most methods have seen limited adoption partly due to these algorithms' highly-tailored and model-specific nature. On the other hand, transformers are flexible, neural sequence models…
Time series forecasting is crucial for many fields, such as disaster warning, weather prediction, and energy consumption. The Transformer-based models are considered to have revolutionized the field of sequence modeling. However, the…
We propose and study properties of maximum likelihood estimators in the class of conditional transformation models. Based on a suitable explicit parameterisation of the unconditional or conditional transformation function, we establish a…
Maximum a posteriori and Bayes estimators are two common methods of point estimation in Bayesian Statistics. It is commonly accepted that maximum a posteriori estimators are a limiting case of Bayes estimators with 0-1 loss. In this paper,…
Gaussian mixture models are central to classical statistics, widely used in the information sciences, and have a rich mathematical structure. We examine their maximum likelihood estimates through the lens of algebraic statistics. The MLE is…
Transformers are deep architectures that define ``in-context maps'' which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In previous work, we…
Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters…
Expanding a lower-dimensional problem to a higher-dimensional space and then projecting back is often beneficial. This article rigorously investigates this perspective in the context of finite mixture models, namely how to improve inference…
Pattern recognition based on a high-dimensional predictor is considered. A classifier is defined which is based on a Transformer encoder. The rate of convergence of the misclassification probability of the classifier towards the optimal…
The main challenge in Bayesian models is to determine the posterior for the model parameters. Already, in models with only one or few parameters, the analytical posterior can only be determined in special settings. In Bayesian neural…
The main approach to inference for multivariate extremes consists in approximating the joint upper tail of the observations by a parametric family arising in the limit for extreme events. The latter may be expressed in terms of…
Seemingly unrelated linear regression models are introduced in which the distribution of the errors is a finite mixture of Gaussian components. Identifiability conditions are provided. The score vector and the Hessian matrix are derived.…
We establish the consistency of a nonparametric maximum likelihood estimator for a class of stochastic inverse problems. We proceed by embedding the framework into the general settings of early results of Pfanzagl related to mixtures.
The Bayesian approach to machine learning amounts to computing posterior distributions of random variables from a probabilistic model of how the variables are related (that is, a prior distribution) and a set of observations of variables.…
Parameter estimation is one of the most important tasks in statistics, and is key to helping people understand the distribution behind a sample of observations. Traditionally parameter estimation is done either by closed-form solutions…
Transformers have dominated empirical machine learning models of natural language processing. In this paper, we introduce basic concepts of Transformers and present key techniques that form the recent advances of these models. This includes…
The identification of nonlinear dynamics from observations is essential for the alignment of the theoretical ideas and experimental data. The last, in turn, is often corrupted by the side effects and noise of different natures, so…
The transformer is a neural network component that can be used to learn useful representations of sequences or sets of data-points. The transformer has driven recent advances in natural language processing, computer vision, and…
In order to rigorously define maximum-a-posteriori estimators for nonparametric Bayesian inverse problems for general Banach space valued parameters, we derive and prove certain previously postulated but unproven bounds on small ball…