Statistics
Simulating realistic wet and dry spells is central in weather generators and climate-impact studies. While finite-order Markov chains are standard, they often fail to reproduce persistent dry conditions due to their inherent subexponential…
Latent Gaussian models (LGMs) are a popular class of Bayesian hierarchical models that include Gaussian processes, as well as certain spatial models and mixed-effect models. Efficient Bayesian inference of LGMs often requires marginalizing…
This work addresses the challenges of robust covariance estimation and interpretable outlier detection for multivariate functional data with separable covariance structure. We develop a method that simultaneously improves robustness and…
We develop a rigorous statistical theory of multi-head attention (MHA) as an ensemble of Nadaraya-Watson (NW) kernel regression estimators. Building on the algebraic identity between single-head softmax attention and the NW estimator, we…
Background: Days Alive and at Home (DAH) over a pre-defined follow-up period is a novel post-intervention composite outcome that combines data from at least three components: (i) initial length of hospital stay, (ii) length of total…
We propose a new regularized optimal transport (OT) formulation, termed sliced-regularized optimal transport (SROT). Unlike entropic OT (EOT), which regularizes the transport plan toward an independent coupling, SROT regularizes it toward a…
We analyze the filing-side legal infrastructure of eviction using 755,004 Philadelphia Municipal Court landlord-tenant records filed between 1969 and 2022, of which 747,125 are residential. Eviction in Philadelphia is organized upstream by…
Traditional step-stress accelerated life testing models assume that test units originate from a homogeneous population. Recently, Lu and Kateri (2025) proposed a heterogeneous cumulative exposure based SSALT model to account for the…
Net benefit is widely used and reported to evaluate the clinical utility of prediction models, yet its interpretation often remains difficult in practice. In this didactical note, we develop two complementary interpretations that make net…
While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images…
Many statistical problems can be addressed by applying a multiple testing procedure (MTP) that controls either the Family-wise Error Rate (FWER) or False Discovery Rate (FDR) under unknown arbitrarily-interdependent $p$-values, without…
Understanding interaction effects among variables is important for regression modeling in various applications. The conventional approach of quantifying interactions as the product of variables often lacks clear interpretability, especially…
OBJECTIVE: To propose time-to-event estimators that help evaluate incident diagnostic coding and possible upcoding in Medicare as well as introduce an open-source software package that enables more reproducible methods development relevant…
We study various types of consistency of honest decision trees and random forests in the regression setting. In contrast to related literature, our proofs are elementary and follow the classical arguments used for smoothing methods. Under…
In developing data-driven modeling methodologies, there is an ongoing need to reconcile the strong predictive performance of opaque black-box models with the transparency required for critical applications. This work introduces an…
Two-phase sampling offers a cost-effective way to validate error-prone covariate measurements in biomedical databases. Inexpensive or easy-to-obtain information is collected for the entire study in Phase I. Then, a subset of patients…
Determining whether vaccine efficacy wanes is important for individual and public decision making. Yet, quantification of waning is a subtle task. The classical approaches cannot be interpreted as measures of declining efficacy unless we…
Variational inference (VI) is a cornerstone of modern Bayesian learning, enabling approximate inference in complex models. However, its formulation depends on expectations and divergences defined through high-dimensional integrals, often…
This paper investigates a recursive formulation of auto-regressive multi-fidelity Gaussian process regression in the challenging setting of noisy and non-nested high- and low-fidelity data. We propose a decoupled optimization strategy based…
Detecting multimodality in empirical distributions is a fundamental problem in statistics and data analysis, with applications ranging from clustering to the study of complex systems. In practice, however, assessing departures from…