统计方法学
Sample overlap is a common issue in evidence synthesis in the field of medical research, particularly when integrating findings from observational studies utilizing existing databases such as registries. Due to the general inaccessibility…
An index of an effective number of variables (ENV) is introduced for model selection in nested models. This is the case, for instance, when we have to decide the order of a polynomial function or the number of bases in a nonlinear…
Recently a new experimental approach, the hybrid experimental design (HED), was introduced to enable investigators to answer scientific questions about building behavioral interventions in which human-delivered and digital components are…
Choi and Yuan (2025) propose a novel approach to applying matrix completion to the problem of estimating causal effects in panel data. The key insight is that even in the presence of structured patterns of missing data -- i.e. selection…
Incorporating external data can improve the efficiency of clinical trials, but distributional mismatches between current and external populations threaten the validity of inference. While numerous dynamic borrowing methods exist, the…
Quantifying distributional separation across groups is fundamental in statistical learning and scientific discovery, yet most classical discrepancy measures are tailored to two-group comparisons. We generalize the underlap coefficient…
A new lifetime model, named the Modi linear failure rate distribution, is suggested. This flexible model is capable of accommodating a wide range of hazard rate shapes, including decreasing, increasing, bathtub, upside-down bathtub, and…
The problem of estimating the growth rate of a birth and death processes based on the coalescence times of a sample of $n$ individuals has been considered by several authors (\cite{stadler2009incomplete, williams2022life,…
Community detection in multi-layer networks is a fundamental task in complex network analysis across various areas like social, biological, and computer sciences. However, most existing algorithms assume that the number of communities is…
Semi-parametric quantile regression (SPQR) is a flexible approach to density regression that learns a spline-based representation of conditional density functions using neural networks. As it makes no parametric assumptions about the…
We propose a dynamic multiplicative factor model for process data, which arise from complex problem-solving items, an emerging testing mode in large-scale educational assessment. The proposed model can be viewed as an extension of the…
Consider a group of individuals (subjects) participating in the same psychological tests with numerous questions (items) at different times, where the choices of each item have an implicit ordering. The observed responses can be recorded in…
In recent years, there has been substantial interest in the task of selective inference: inference on a parameter that is selected from the data. Many of the existing proposals fall into what we refer to as the \emph{infer-and-widen}…
Sparse and outlier-robust Principal Component Analysis (PCA) has been a very active field of research recently. Yet, most existing methods apply PCA to a single dataset whereas multi-source data-i.e. multiple related datasets requiring…
Matching is one of the most widely used causal inference designs in observational studies, but post-matching confounding bias remains a critical concern. This bias includes overt bias from inexact matching on measured confounders and hidden…
Multidimensional scaling (MDS) is widely used to reconstruct a low-dimensional representation of high-dimensional data while preserving pairwise distances. However, Bayesian MDS approaches based on Markov chain Monte Carlo (MCMC) face…
In multivariate longitudinal studies, associations between outcomes often exhibit time-varying and individual level heterogeneity, motivating the modeling of correlations as an explicit function of time and covariates. However, most…
High-dimensional datasets are frequently subject to contamination by outliers and heavy-tailed noise, which can severely bias standard regularized estimators like the Lasso. While Maximum Mean Discrepancy (MMD) has recently been introduced…
Conditional independence tests (CIT) are widely used for causal discovery and feature selection. Even with false discovery rate (FDR) control procedures, they often fail to provide frequentist guarantees in practice. We highlight two common…
We study the use of exchangeable multi-task Gaussian processes (GPs) for causal inference in panel data, applying the framework to two settings: one with a single treated unit subject to a once-and-for-all treatment and another with…