统计学
Multicollinearity is a long lasting challenge in observational causal inference, especially in regressions -- highly correlated independent variables make it hard to isolate their individual impacts on outcomes of interest. While common…
Resampling-based simultaneous confidence bands for cumulative hazard functions often undercover in finite samples with right censoring. We study two aspects of the construction that can contribute to this gap, the resampling scheme and the…
Model selection and hypothesis testing are important tasks on networks. A key challenge lies in the inherent dependence in network data, as well as the fact that typically only a single realization is observed. As a result, many existing…
Epidemiologists increasingly use machine learning to adjust for high-dimensional confounding. Augmented inverse probability weighting (AIPW) and targeted maximum likelihood estimation (TMLE) are most widely used but may yield different…
With the emergence of various tensor data, tensor completion from partial measurements has attracted widespread attention in data science and signal processing. Total Variation (TV) has been widely used as an effective regularization…
Modern deep learning has been shown to operate at the edge of stability, routinely using learning rates far larger than those justified by classical optimization theory. Most prior analyses of the edge of stability phenomenon focus on…
Cross-fitting is not a refinement of survey-weighted causal machine learning but, once the nuisances are flexible, what restores valid inference. We study the population average treatment effect under a stratified multistage design,…
Alternating recurrent events -- event-times of a specific nature that trigger a secondary refractory period -- occur in a wide-range of fields, including behavioral science, criminal justice, and biostatistics. Analysis of these events…
In this paper, we attempt to enhance the theoretical understanding of convolutional neural networks (CNNs) as feature extractors in classification tasks by analyzing them through the lens of Cover's function-counting theory. Specifically,…
The U.S.\ Census Bureau's Low Response Score (LRS) is a central planning instrument for identifying places likely to require additional self-response outreach and nonresponse follow-up. The published LRS is intentionally interpretable: it…
Contrastive embedding models trained with scale-invariant losses are typically paired with distance metrics like cosine similarity, effectively ignoring embedding magnitudes. However, surprisingly, empirical studies reveal that despite…
Modern statistical learning problems often involve multiple related data sets, where learning efficiency on a target set can be improved by utilizing related source sets, while heterogeneity among the source sets may introduce bias.…
We propose doubly robust adaptive conformal inference (DR-ACI), which constructs prediction intervals for doubly robust pseudo-outcomes under temporal dependence.
Normalizing Flows excel at modeling a single fixed density, yet many problems across the sciences, such as high energy physics, instead require modeling how that density deforms as a function of continuous parameters: the strength of a…
We consider sparse multivariate stochastic systems that evolve in continuous time according to a causal mechanism and present methodology to recover the system's time-infinitesimal transition mechanism from mere cross-sectional data. This…
Neural networks are known to be susceptible to over-reliance on spurious correlations. However, the precise mechanism by which models exploit shortcut features is not fully understood, and algorithms to mitigate this behavior rely on as yet…
In numerous scientific and industrial settings, observed multivariate time series are often nonstationary in nature, i.e., comprise data whose second order properties vary over time. An additional feature of many modern datasets is that the…
Delayed generalization (\ie~grokking) refers to the phenomenon in which a neural network fits its training data early in training but only begins to generalize after a prolonged delay, often through an abrupt transition. Despite extensive…
Rapid prototyping of algorithms is a critical step in modern machine learning. Most algorithms exploit linear algebra, creating a need for lightweight numerical routines which -- while potentially sub-optimal for the task at hand -- can be…
In genome wide association studies (GWASs) based on a case-control design, single nucleotide polymorphisms (SNPs) are typically evaluated for an association test and a Hardy-Weinberg equilibrium (HWE) goodness-of-fit test. SNPs are then…