Related papers: Generalized massive optimal data compression
We present a method for radical linear compression of datasets where the data are dependent on some number $M$ of parameters. We show that, if the noise in the data is independent of the parameters, we can form $M$ linear combinations of…
The goal in thinning is to summarize a dataset using a small set of representative points. Remarkably, sub-Gaussian thinning algorithms like Kernel Halving and Compress can match the quality of uniform subsampling while substantially…
Nonparametric regression for massive numbers of samples (n) and features (p) is an increasingly important problem. In big n settings, a common strategy is to partition the feature space, and then separately apply simple models to each…
For a collection of distributions over a countable support set, the worst case universal compression formulation by Shtarkov attempts to assign a universal distribution over the support set. The formulation aims to ensure that the universal…
A signature result in compressed sensing is that Gaussian random sampling achieves stable and robust recovery of sparse vectors under optimal conditions on the number of measurements. However, in the context of image reconstruction, it has…
We discuss the statistical properties of a recently introduced unbiased stochastic approximation to the score equations for maximum likelihood calculation for Gaussian processes. Under certain conditions, including bounded condition number…
Modern data analysis frequently involves variables with highly non-Gaussian marginal distributions. However, commonly used analysis methods are most effective with roughly Gaussian data. This paper introduces an automatic transformation…
Gaussian process regression is a powerful Bayesian nonlinear regression method. Recent research has enabled the capture of many types of observations using non-Gaussian likelihoods. To deal with various tasks in spatial modeling, we benefit…
The influx of massive amounts of data from current and upcoming cosmological surveys necessitates compression schemes that can efficiently summarize the data with minimal loss of information. We introduce a method that leverages the…
Numerical climate model simulations run at high spatial and temporal resolutions generate massive quantities of data. As our computing capabilities continue to increase, storing all of the data is not sustainable, and thus it is important…
Big data is ubiquitous in practices, and it has also led to heavy computation burden. To reduce the calculation cost and ensure the effectiveness of parameter estimators, an optimal subset sampling method is proposed to estimate the…
A common challenge in estimating parameters of probability density functions is the intractability of the normalizing constant. While in such cases maximum likelihood estimation may be implemented using numerical integration, the approach…
Recently, several studies proposed non-linear transformations, such as a logarithmic or Gaussianization transformation, as efficient tools to recapture information about the (Gaussian) initial conditions. During non-linear evolution, part…
Random projections became popular tools to process big data. In particular, when applied to Nonnegative Matrix Factorization (NMF), it was shown that structured random projections were far more efficient than classical strategies based on…
Consider a Gaussian memoryless multiple source with $m$ components with joint probability distribution known only to lie in a given class of distributions. A subset of $k \leq m$ components are sampled and compressed with the objective of…
We present a framework for the theoretical analysis of ensembles of low-complexity empirical risk minimisers trained on independent random compressions of high-dimensional data. First we introduce a general distribution-dependent…
The modern practice of Radio Astronomy is characterized by extremes of data volume and rates, principally because of the direct relationship between the signal to noise ratio that can be achieved and the need to Nyquist sample the RF…
As computer resources become increasingly limited, traditional statistical methods face challenges in analyzing massive data, especially in functional data analysis. To address this issue, subsampling offers a viable solution by…
Today, with the growing demands of information storage and data transfer, data compression is becoming increasingly important. Data Compression is a technique which is used to decrease the size of data. This is very useful when some huge…
How much cosmological information can we reliably extract from existing and upcoming large-scale structure observations? Many summary statistics fall short in describing the non-Gaussian nature of the late-time Universe in comparison to…