统计方法学
Missing data are ubiquitous in public health research. When estimating causal effects, there are well-established methods to address bias to due missing outcomes. Commonly, causal estimands are defined under hypothetical interventions to…
Dynamical systems in biology are complex, and one often does not have comprehensive knowledge about the interactions involved. Chemical reaction network (CRN) inference aims to identify, from observing species concentrations over time, the…
Maximum likelihood estimation in nonlinear models can exhibit substantial instability in finite samples when the data provide limited information about certain parameters. Such instability is driven by rare but extreme realizations of the…
Understanding treatment effect heterogeneity is important for decision making in medical and clinical practices, or handling various engineering and marketing challenges. When dealing with high-dimensional covariates or when the effect…
Inverse probability (IP) weighting of marginal structural models (MSMs) can provide consistent estimators of time-varying treatment effects under correct model specifications and identifiability assumptions, even in the presence of…
In this paper, we estimate the sparse dependence structure in the tail region of a multivariate random vector, potentially of high dimension. The tail dependence is modeled via a graphical model for extremes embedded in the H\"usler-Reiss…
The paper considers parameter estimation in count data models using penalized likelihood methods. The motivating data consists of multiple independent count variables with a moderate sample size per variable. The data were collected during…
We study the problem of identifying change points in high-dimensional generalized linear models, and propose an approach based on sample-weighted empirical risk minimization. Our method, Weighted ERM, encodes priors on the change points via…
We consider the problem of clustering nested or hierarchical data, where observations are grouped and there are both group-level and observation-level variables. In our motivating OneK1K dataset, observations consist of single-cell…
Heterogeneous network data with rich nodal information become increasingly prevalent across multidisciplinary research, yet accurately modeling complex nodal heterogeneity and simultaneously selecting influential nodal attributes remains an…
We develop a novel reference prior for Gaussian hierarchical models with intrinsic conditional autoregressive (ICAR) random effects. This is particularly important in the context of objective Bayes variable selection with sample size $n$…
High-dimensional inference methods often rely on coefficient sparsity, an assumption that can be restrictive when signals are dense but individually weak. In such settings, valid inference may still be possible if the covariates exhibit…
Quantifying the similarity of two or more datasets has widespread applications in statistics and machine learning. The method choice is, however, difficult due to the abundance of proposed methods and the lack of neutral comparison studies,…
Long questionnaires increase the response burden for patients and healthcare workers. In the treatment of Parkinson's disease, the MDS-UPDRS questionnaire to track disease progression may be underutilized due to time requirements. While…
There is rising interest in using Machine Learning (ML) model predictions as outcomes in causal analysis. However, these methods have faced challenges in finding the true treatment effects. It is also challenging to make choices about which…
Simultaneously testing $K$ hypotheses while controlling the family-wise error rate is a fundamental problem in statistics. Existing procedures (Bonferroni, Holm, Hochberg, Hommel) provide valid control but sacrifice power, increasingly so…
Inferring directed acyclic graphs (DAGs) from data via Markov chain Monte Carlo (MCMC) is computationally challenging in moderate-to-high dimensional settings because their discrete sampling space grows super-exponentially with the number…
An individualized treatment rule (ITR) tailors treatments to a patient's specific characteristics. However, randomized controlled trials (RCTs) are often underpowered to detect the treatment effect heterogeneity needed for reliable ITR…
Causal mediation analysis in cluster-randomized trials (CRTs) is complicated by the presence of multiple mediators, intracluster correlation, and within-cluster interference. Existing mediation methods often fall short in accommodating…
Missing values in electronic health record (EHR) data pose a significant challenge for epidemiologic research. Traditional methods for handling missing data, like mean imputation, may introduce bias. Multiple imputation (MI) offers a…