Related papers: A Note on Automatic Data Transformation
Many real data sets contain numerical features (variables) whose distribution is far from normal (gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them…
We present a technique for constructing suitable posterior probability distributions in situations for which the sampling distribution of the data is not known. This is very useful for modern scientific data analysis in the era of "big…
Few-shot image classification has recently witnessed the rise of representation learning being utilised for models to adapt to new classes using only a few training examples. Therefore, the properties of the representations, such as their…
We introduce a novel approach based on stochastic optimization to find the optimal sampling distribution for the data-driven stability analysis of switched linear systems. Our goal is to address limitations of existing approaches, in…
Recently, several studies proposed non-linear transformations, such as a logarithmic or Gaussianization transformation, as efficient tools to recapture information about the (Gaussian) initial conditions. During non-linear evolution, part…
Anomaly detection is a field of intense research. Identifying low probability events in data/images is a challenging problem given the high-dimensionality of the data, especially when no (or little) information about the anomaly is…
Many variables in the social, physical, and biosciences, including neuroscience, are non-normally distributed. To improve the statistical properties of such data, or to allow parametric testing, logarithmic or logit transformations are…
We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a…
Several distributions and families of distributions are proposed to model skewed data, think, e.g., of skew-normal and related distributions. Lambert W random variables offer an alternative approach where, instead of constructing a new…
In 2023, the U.S. Food and Drug Administration issued guidance for adjustment of covariates in randomized clinical trials, emphasizing its role in enhancing precision and power through prognostic baseline variables. Despite its potential,…
Probabilistic programming is perfectly suited to reliable and transparent data science, as it allows the user to specify their models in a high-level language without worrying about the complexities of how to fit the models. Static analysis…
Logarithmic transformation of the data has been recommended by the literature in the case of highly skewed distributions such as those commonly found in information science. The purpose of the transformation is to make the data conform to…
Symbolic data analysis (SDA) aggregates large individual-level datasets into a small number of distributional summaries, such as random rectangles or random histograms. The inference is carried out using these summaries in place of the…
Finite mixture of Gaussian distributions provide a flexible semi-parametric methodology for density estimation when the variables under investigation have no boundaries. However, in practical applications variables may be partially bounded…
Data compression has become one of the cornerstones of modern astronomical data analysis, with the vast majority of analyses compressing large raw datasets down to a manageable number of informative summaries. In this paper we provide a…
Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for…
The superposition of data sets with internal parametric self-similarity is a longstanding and widespread technique for the analysis of many types of experimental data across the physical sciences. Typically, this superposition is performed…
This article provides an original understanding of the behavior of a class of graph-oriented semi-supervised learning algorithms in the limit of large and numerous data. It is demonstrated that the intuition at the root of these methods…
Modern data workflows are inherently adaptive, repeatedly querying the same dataset to refine and validate sequential decisions, but such adaptivity can lead to overfitting and invalid statistical inference. Adaptive Data Analysis (ADA)…
Parameter estimation is one of the most important tasks in statistics, and is key to helping people understand the distribution behind a sample of observations. Traditionally parameter estimation is done either by closed-form solutions…