Statistics
We propose a new statistical estimation framework for a large family of global sensitivity analysis indices. Our approach is based on rank statistics and uses an empirical correlation coefficient recently introduced by Chatterjee [9]. We…
Stochastic simulators are increasingly used to expand the frontier of scientific knowledge and inform decision-making across real-world contexts. Simulator calibration, a process by which internal model inputs are tuned to match some…
Randomized clinical trials typically aim to estimate a marginal treatment effect. While covariate adjustment can improve precision, it may change the estimand in nonlinear models due to noncollapsibility, leading to conditional rather than…
External validation of clinical prediction models is crucial for assessing whether they are fit for use. The $C$-statistic is a widely used measure of discriminative performance of such models predicting a binary outcome. A method for…
Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits…
Estimating the probabilities of rare failure events is a key challenge in the reliability analysis of physical systems. Subset simulation (SS) is a very popular adaptive Monte Carlo method for this problem. In SS, the small failure…
Quantitative practice across statistics, engineering, and machine learning has been transformed by the automation of inference. Predictions are produced, validated, and deployed at scale and speed that human-mediated reasoning could not…
We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense…
Directed acyclic graphs (DAGs) constitute a central modeling tool to enable principled reasoning about cause-effect interactions in complex systems. However, since the causal structure underlying a group of variables is often unknown and…
Rank regression offers robustness to outliers and heavy-tailed response distributions, invariance to monotonic transformations, and improved efficiency under non-Gaussian errors, making it a versatile tool for analyzing complex data. This…
In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that…
Applications of artificial intelligence (AI) in drug development continue to increase at a rapid pace. Regulatory authorities have provided increasingly clear perspectives on the use of AI in regulated applications, including recent draft…
Meta-analyses of two-group studies that report median differences typically rely on methods that require, in addition to the median difference and sample size, summary measures of dispersion such as quartiles or ranges. Studies that do not…
Data represented as covariance-type matrices arise in many fields, including brain functional connectivity and diffusion tensor imaging. We develop the MFM-Wishart, a Bayesian model-based clustering approach for such data that combines…
Individual fairness, the notion that "similar individuals should be treated similarly," provides a strong and flexible fairness guarantee for algorithmic decision makers. However, a barrier to implementing individual fairness in practice is…
Large language models (LLMs) offer a scalable mechanism to elicit domain-informed prior information for high-dimensional variable selection. However, existing methods such as LLM-Lasso are sensitive to weight quality, with performance…
Multi-fidelity Monte Carlo (MFMC) is a variance reduction method that leverages a multi-fidelity ensemble of models of varying cost and accuracy levels. Constructing an MFMC estimator with optimal variance requires knowledge of the…
Score matching is an alternative to maximum likelihood estimation when the normalizing constant is unknown or too costly to evaluate. However, vanilla score matching has shown to be inefficient relative to maximum likelihood estimation for…
Inferring racial discrimination in police use of force -- the average causal effect of civilian race on use of force -- requires two assumptions about policing prior to potential use of force: that officers do not discriminate in whom they…
In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to…