Related papers: Computationally Efficient Estimators for Dimension…
In recent years, large high-dimensional data sets have become commonplace in a wide range of applications in science and commerce. Techniques for dimension reduction are of primary concern in statistical analysis. Projection methods play an…
Compressed Counting (CC) was recently proposed for very efficiently computing the (approximate) $\alpha$th frequency moments of data streams, where $0<\alpha <= 2$. Several estimators were reported including the geometric mean estimator,…
We develop efficient binary (i.e., 1-bit) and multi-bit coding schemes for estimating the scale parameter of $\alpha$-stable distributions. The work is motivated by the recent work on one scan 1-bit compressed sensing (sparse signal…
This paper will focus on three different aspects in improving the current practice of stable random projections. Firstly, we propose {\em very sparse stable random projections} to significantly reduce the processing and storage cost, by…
Analyzing high-dimensional data with manifold learning algorithms often requires searching for the nearest neighbors of all observations. This presents a computational bottleneck in statistical manifold learning when observations of…
Dimension reduction is often an important step in the analysis of high-dimensional data. PCA is a popular technique to find the best low-dimensional approximation of high-dimensional data. However, classical PCA is very sensitive to…
Designing scalable estimation algorithms is a core challenge in modern statistics. Here we introduce a framework to address this challenge based on parallel approximants, which yields estimators with provable properties that operate on the…
We provide a simple method and relevant theoretical analysis for efficiently estimating higher-order lp distances. While the analysis mainly focuses on l4, our methodology extends naturally to p = 6,8,10..., (i.e., when p is even).…
It is preferred that feature selectors be \textit{stable} for better interpretabity and robust prediction. Ensembling is known to be effective for improving the stability of feature selectors. Since ensembling is time-consuming, it is…
We consider the question of learning the natural parameters of a $k$ parameter minimal exponential family from i.i.d. samples in a computationally and statistically efficient manner. We focus on the setting where the support as well as the…
We study the use of "sign $\alpha$-stable random projections" (where $0<\alpha\leq 2$) for building basic data processing tools in the context of large-scale machine learning applications (e.g., classification, regression, clustering, and…
We propose a new estimator based on a linear programming method for smooth frontiers of sample points. The derivative of the frontier function is supposed to be Holder continuous.The estimator is defined as a linear combination of kernel…
We introduce sparse random projection, an important dimension-reduction tool from machine learning, for the estimation of discrete-choice models with high-dimensional choice sets. Initially, high-dimensional data are compressed into a…
This paper addresses the estimation of locally stationary long-range dependent processes, a methodology that allows the statistical analysis of time series data exhibiting both nonstationarity and strong dependency. A time-varying…
Large language model (LLM) training is often bottlenecked by memory constraints and stochastic gradient noise in extremely high-dimensional parameter spaces. Motivated by empirical evidence that many LLM gradient matrices are effectively…
For many tasks of data analysis, we may only have the information of the explanatory variable and the evaluation of the response values are quite expensive. While it is impractical or too costly to obtain the responses of all units, a…
Algorithmic stability is a central concept in statistics and learning theory that measures how sensitive an algorithm's output is to small changes in the training data. Stability plays a crucial role in understanding generalization,…
Distance queries are a basic tool in data analysis. They are used for detection and localization of change for the purpose of anomaly detection, monitoring, or planning. Distance queries are particularly useful when data sets such as…
Subsampling is an efficient method to deal with massive data. In this paper, we investigate the optimal subsampling for linear quantile regression when the covariates are functions. The asymptotic distribution of the subsampling estimator…
Many applications using large datasets require efficient methods for minimizing a proximable convex function subject to satisfying a set of linear constraints within a specified tolerance. For this task, we present a proximal projection…