Statistics
CART random forests are among the most widely used modern predictive methods, with well-documented empirical success. Yet, at the mechanistic level, the algorithm is often treated as a black box because of its complexity. In this paper, we…
We propose KO-PDE-IDENT, a data-driven framework for identifying parsimonious partial differential equations (PDEs) with false discovery rate (FDR) control. PDE discovery from noisy observations is often hindered by extreme…
We introduce the Hyperedge-triggered Hawkes (HTH) process for inferring higher-order interaction structure in multi-cellular systems from asynchronous event-time data. Beyond standard pairwise excitation, the HTH intensity includes a term…
The use of dual system estimation (DSE) is heavily used in Census Bureau operations. With DSE methods, it is important to implement methods to infer the population size among those with missing data from one or both data sources. The use of…
Control charts for process monitoring are widely used in practice. Most control charts require the monitored (residuals) process to be serially independent (and to satisfy specified distributional assumptions), whereas undetected dependence…
Statistical procedures rarely retain all features of the observed data. A sufficient statistic removes information irrelevant to a parameter; a maximum likelihood estimate compresses an empirical objective into an optimizing point; and a…
Individualized randomized experiments are central to online platforms for optimizing personalized decisions in complex environments. In two-sided markets, however, standard treatment effect estimation is often invalid due to strong temporal…
This paper studies causal discovery for a directed acyclic graph under a structural equation model with additive heteroscedastic errors. We first establish new identifiability results for location-scale noise models, showing that…
Win statistics, including the win ratio, net benefit, and win odds, summarize treatment effects on hierarchical composite endpoints by sequentially comparing patient pairs on component outcomes ordered by clinical importance, proceeding to…
This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as…
Estimating equations arise in a wide range of statistical applications, including longitudinal and clustered data analysis, survival analysis, econometrics, and semiparametric inference. In high-dimensional settings, adding…
Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and…
Small-area precipitation forecasts support real-time decisions for reservoir operation, irrigation planning, drought monitoring, and flash-flood response. Operational value depends not only on point accuracy, but also on calibrated…
A representation that scrambles the true degrees of freedom of the world cannot support reliable planning or compositional generalization. We prove that LeJEPA (alignment plus Gaussian regularization) linearly recovers the world's latent…
Analyses of recurrent hypoglycemia are critical for effective treatment management in diabetic patients. Typically, within-subject dependency in such analyses is captured through subject-level frailty. Recent research has modeled recurrent…
We propose a Bayesian latent variable model to estimate covariate-assisted dependence structures across multiple modalities of multivariate data that may be observed asynchronously. This setting commonly arises in longitudinal biomedical…
When treatment effects are naturally expressed as ratios -- as in medicine, pricing, and marketing -- the ratio-based CATE $\tau(x) = E[Y|W=1,X=x] / E[Y|W=0,X=x]$ is the appropriate estimand. Yet existing estimators either impose a…
We study a nonlinear factor model in which observed responses depend on low-rank latent factors through an unknown monotone link function. This setting is challenging and largely underexplored due to severe nonconvexity and identifiability…
Length-biased distributions arise naturally in environmental, reliability, and economic studies where the sampling mechanism favors larger observational units. In this paper, we propose a quantile regression model based on the length-biased…
While Conformal Prediction (CP) has proven to be a powerful framework for uncertainty quantification, guaranteeing conditional coverage remains a central challenge. Although finite-sample, distribution-free conditional validity is known to…