统计理论
Logistic regression is a classical model for describing the probabilistic dependence of binary responses to multivariate covariates. We consider the predictive performance of the maximum likelihood estimator (MLE) for logistic regression,…
Estimating nonlinear functionals of probability distributions from samples is a fundamental statistical problem. The "plug-in" estimator obtained by applying the target functional to the empirical distribution of samples is biased.…
Generalized Sliced Inverse Regression (GSIR) is one of the most important methods for nonlinear sufficient dimension reduction. As shown in Li and Song (2017), it enjoys a convergence rate that is independent of the dimension of the…
Sparse recovery is among the most well-studied problems in learning theory and high-dimensional statistics. In this work, we investigate the statistical and computational landscapes of sparse recovery with $\ell_\infty$ error guarantees.…
We study deformations of the geodesic distances on a domain of R N induced by a function called conformal factor. We show that under a positive reach assumption on the domain (not necessarily a submanifold) and mild assumptions on the…
Extreme value distributions are routinely employed to assess risks connected to extreme events in a large number of applications. They typically are two- or three- parameter distributions: the inference can be unstable, which is…
We study nonparametric maximum likelihood estimation of probability densities under a total variation (TV) type penalty, sectional variation norm (also named as Hardy-Krause variation). TV regularization has a long history in regression and…
We study the problem of nonparametric estimation of the linear multiplier function $\theta(t)$ for processes satisfying stochastic differential equations of the type $$dX_t=\theta(t) X_tdt+ \epsilon dZ^{q,H}_t, X_0=x_0, 0\leq t \leq T$$…
U-statistics are a fundamental class of estimators that generalize the sample mean and underpin much of nonparametric statistics. Although extensively studied in both statistics and probability, key challenges remain: their high…
The geometric median, a notion of center for multivariate distributions, has gained recent attention in robust statistics and machine learning. Although conceptually distinct from the mean (i.e., expectation), we demonstrate that both are…
The coalescent is a foundational model of latent genealogical trees under neutral evolution, but suffers from intractable sampling probabilities. Methods for approximating these sampling probabilities either introduce bias or fail to scale…
Conformal inference is a versatile tool for building prediction sets in regression or classification. We study the false coverage proportion (FCP) in a simultaneous inference setting with a calibration sample of $n$ points and a test sample…
Sampling from discrete distributions is a ubiquitous task in machine learning, recently revisited by the emergence of discrete diffusion models. While Langevin algorithms constitute the state of the art for continuous spaces, discrete…
We study the problem of detecting a planted star in the Erd{\H{o}}s--R{\'e}nyi random graph $G(n,m)$, formulated as a hypothesis test. We determine the scaling window for critical detection in $m$ in terms of the star size, and characterize…
We develop a unified framework for goodness-of-fit (GOF) testing through the lens of Bayes risk. Classical GOF procedures are commonly calibrated either at fixed significance level (CLT scale) or through exponential error exponents (LDP…
Changepoint localization aims to provide confidence sets for a changepoint (if one exists). Existing methods either relying on strong parametric assumptions or providing only asymptotic guarantees or focusing on a particular kind of…
A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true…
Bayesian methods are often optimal, yet increasing pressure for fast computations, especially with streaming data, brings renewed interest in faster, possibly sub-optimal, solutions. The extent to which these algorithms approximate Bayesian…
Given data $\{({\boldsymbol x}_i,y_i): i\le n\}$, with ${\boldsymbol x}_i$ standard $d$-dimensional Gaussian feature vectors, and $y_i\in{\mathbb R}$ response variables, we study the general problem of learning a model parametrized by…
We study Gaussian Process Thompson Sampling (GP-TS) for sequential decision-making over compact, continuous action spaces and provide a frequentist regret analysis based on fractional Gaussian process posteriors, without relying on domain…