统计方法学
In propensity score weighted analysis, robust variance that does not account for weight estimation is commonly used. In propensity score weighted Cox models (CoxPSW), the robust variance is known to be conservative when weights for the…
Incorporation of external information into high-dimensional modeling for gene expression data has been shown, both theoretically and empirically, to substantially enhance performance. Such external information, sometimes referred to as…
Forced-choice conjoint designs have become a staple method in the experimentalist's toolkit. However, the forced-choice outcome is neither always consistent with the types of choices individuals make in real political contexts, nor is it…
The integrated conditional moment (ICM) test is a classical and widely used method for assessing the adequacy of regression models. Although it performs well in fixed-dimension settings, its behavior changes dramatically when the predictor…
Principal coordinates analysis (PCoA) is a standard exploratory tool for microbiome beta-diversity studies, but its axes are defined by pairwise dissimilarities and therefore do not directly identify the taxa driving an ordination. We…
This paper proposes a fully Bayesian framework for node-level outlier detection in graph signals, where measurements are observed on the nodes of an underlying graph. Unlike traditional outlier detection methods, our approach accounts for…
Propensity score weighting approaches have been widely implemented in clinical research to estimate the effects of a treatment or exposure while mitigating the risk of confounding in the absence of random assignment. In practice, when…
AI tools increasingly guide targeted interventions in healthcare, education, and recruiting. Algorithms score individuals, trigger outreach to those above a threshold (e.g., high-risk or high-value), and encourage them to request service;…
Online A/B testing at scale relies on proxy metrics -- short-term, easily-measured signals used in place of slow-moving long-term outcomes. When the proxy-outcome relationship is heterogeneous across user segments, aggregate correlation can…
Targeted amplicon panels are widely used in oncology diagnostics, but providing per-gene performance guarantees for copy number variant (CNV) detection remains challenging due to amplification artifacts, process-mismatch heterogeneity, and…
In statistics and machine learning, the traditional meaning of the terms `outlier' and `anomaly' is a case in the dataset that behaves differently from the bulk of the data. This raises suspicion that it may belong to a different…
To date, we have seen the emergence of a large literature on multivariate disease mapping. That is, incidence of (or mortality from) multiple diseases is recorded at the scale of areal units where incidence (mortality) across the diseases…
Design-based inference, also known as randomization-based or finite-population inference, provides a principled framework for trustworthy statistical inference by attributing randomness solely to the design mechanism (e.g., treatment…
As generative AI models are increasingly used to simulate real-world systems, quantifying the ``sim-to-real'' gap is critical. For each input setting of interest -- which we call a \emph{scenario}, such as a survey question or operating…
Educational disparities are rooted in and perpetuate social inequalities across multiple dimensions such as race, socioeconomic status, and geography. To reduce disparities, most intervention strategies focus on a single domain and…
A suitable scalar metric can help measure multi-calibration, defined as follows. When the expected values of observed responses are equal to corresponding predicted probabilities, the probabilistic predictions are known as "perfectly…
Lung sepsis remains a significant concern in the Northeastern U.S., yet the national eICU Collaborative Database includes only a small number of patients from this region, highlighting underrepresentation. Understanding clinical variables…
Typical causal effects are defined based on the marginal distribution of potential outcomes. However, many real-world applications require causal estimands involving the joint distribution of potential outcomes to enable more nuanced…
In two-way contingency tables under an asymmetric situation, where the row and column variables are defined as explanatory and response variables, respectively, quantifying the extent to which the explanatory variable contributes to…
Randomized controlled trials (RCTs) often suffer from limited inferential efficiency in estimating treatment effects due to their small sample sizes. In recent years, incorporating external controls (ECs) has gained increasing attention as…