统计方法学
In genome wide association studies (GWASs) based on a case-control design, single nucleotide polymorphisms (SNPs) are typically evaluated for an association test and a Hardy-Weinberg equilibrium (HWE) goodness-of-fit test. SNPs are then…
We propose an extension of the ordered stereotype model (OSM) for ordinal time series data, referred to as the Autoregressive OSM (AR-OSM). The model captures serial dependence by incorporating lagged values of the response as covariates in…
This study proposes coarse-to-fine downscaling (CF-DS), a scalable spatial downscaling method extending coarse-to-fine spatial modeling. Unlike conventional spatial-statistical downscaling methods such as area-to-point kriging, CF-DS does…
Reliable generative AI models critically rely on expert human annotations to evaluate output quality, yet these "gold" labels are expensive to collect and limited in quantity. Organizations thus often turn to collecting vast but noisy…
Classical hypothesis testing frameworks break down in contemporary settings in which null hypotheses are increasingly abstract, the same data are used to both generate and test hypotheses, and minimal assumptions about the underlying data…
Transfer learning leverages knowledge from related source domains to improve learning in a target domain. Recent theoretical advances cover a broad range of regression settings within (generalized) linear models. Despite their diversity,…
Latent class models are central tools for multivariate categorical data from heterogeneous populations, but their standard local-independence assumption is often unrealistic in modern high-dimensional applications. We propose a…
In this paper, we propose a model-based framework to robustify inference for circular data in the presence of anomalous observations, distinguishing between mild and gross anomalies. Starting from a unimodal and symmetric reference model on…
Accurate and scalable land cover classification is essential for global conservation monitoring and policy-making. While remote sensing images provide a cost-effective alternative to ground surveys, current methods often lack principled…
Modern multivariate regression problems involve several related outcomes whose regression effects are not only nonlinear, heterogeneous, and outcome-specific, but also where the residual dependence among outcomes is scientifically…
Learning distributions of longitudinal data is central to tasks such as visualization, completion, classification, and synthetic data generation, but it remains statistically challenging because longitudinal observations are often…
Principal stratification provides a foundational framework for causal inference with intermediate outcomes by defining causal effects within subpopulations, yet existing work has largely focused on average effects across strata rather than…
Standard statistical methods are often inadequate for modeling the joint dependence between linear and circular variables, and existing methods for modeling this dependence are designed only for continuous variables. However, circular data…
We introduce a novel goodness-of-fit (GOF) procedure based on Beta-tree partitions. A Beta-tree produces a data-adaptive partition of the sample space into regions and provides guaranteed finite sample confidence intervals for the…
Time-varying treatment effects, surrogate-identified treatment effects, and mediation effects can all be written as recursive regressions, in which each regression's predicted values become generated outcomes for the next regression. We…
A critical assumption of observational studies is that all confounding variables must be known and sufficiently adjusted for to estimate causal effects. An implicit, and often overlooked, aspect of this assumption is that all confounding…
This paper presents a methodological framework for estimating the comprehensive cohort causal effect (CCCE) in mixed-design clinical studies that combine randomized controlled trials (RCTs) and parallel observational study (OBS). Our…
[Working Draft] Compositional data are central to microbial, ecological, and environmental research, yet often have four features that are difficult to accommodate jointly: exact zeros, latent dependence among components,…
We present a justification of the use of Inverse Probability Weighting (IPW) in a post-Bayesian framework, in which the bias-correction provided by IPW in a frequentist context is reframed as a reweighting of the Kullback-Leibler (KL)…
Prediction-powered inference (PPI) refers to a two-level situation where the statistician observes a set of $(x,y)$ pairs and another set of $x$s with the responses $y$ missing. Also available is some independent background data from which…