统计方法学
Label noise presents a fundamental challenge in modern machine learning, especially when large-scale datasets are generated via automated processes. An increasingly common and important data paradigm, particularly in domains like medical…
The availability of data from multiple heterogeneous environments has motivated methods that remain reliable under distributional shifts. When the joint distribution of response and predictors varies across environments, the response may…
We test the hypothesis that simulataneous linear contrasts of multiple variance components equal zero in a Gaussian variance components model via a parametric bootstrap. Applications include but are not limited to nested and crossed…
We study the problem of estimating locations in time at which the level of technology in an economy changes when given a sequence of time ordered inputs and outputs. We approach the problem through the lens of nonparametric frontier…
Tam [2026] shows that combining Bethel multivariate allocation with Hierarchical Bayes (HB) small area models can substantially reduce survey sample sizes while maintaining domain-level precision and near-nominal coverage of posterior…
Variable fusion in linear regression models is a statistical method that identifies covariates making similar contributions to the response variable and imposes the same coefficient values on them. Many methods for variable fusion also…
Dynamic treatment regimes (DTRs) are sequences of decision rules to guide treatment assignments in response to a patient's evolving, time-varying disease status. Sequential multiple assignment randomized trials (SMARTs) are considered the…
Local Polynomial Regression (LPR) is a powerful tool for nonparametric smoothing, yet it traditionally suffers from a "Euclidean tautology": the variables used to define the local neighborhood are identical to those used in the polynomial…
Functional autoregressive models of order one (FAR(1)) are predominantly estimated by projecting curves onto leading functional principal components and fitting a vector autoregression in score space, requiring a discrete truncation level…
We study split-conformal prediction for regression when the reported prediction set must be a single interval, at target marginal coverage $1-\alpha$, where $\alpha$ is the nominal miscoverage level. Under this reporting constraint, the…
Dynamic multilayer networks arise in many applications where multiple types of relations among a common set of nodes evolve over time. Existing approaches often assume temporal independence, focus on single-layer networks or impose…
Fractionally supervised classification (FSC) offers a flexible framework for combining labeled and unlabeled data in model-based classification, but existing formulations assume simple random sampling. In many applications, however, the…
Whether or not a country is at war, or experiencing escalating or deescalating levels of conflict, has massive ramifications on a country's national and foreign policy. Given a country's history of conflict, or lack thereof, future…
We propose a density-valued vector autoregressive model with latent factors for multivariate time series of density functions. Motivated by weekly regional distributions of SARS-CoV-2 cycle threshold (Ct) values in Brazil, we study their…
A master protocol trial uses a single overarching protocol to test multiple therapies, often across several diseases or subtypes. Although such trials offer considerable flexibility and efficiency, their constrained and non-uniform…
This paper introduces a rectified and renormalized Fisher-Bingham model for compositional data with zeros, motivated in part by the presence of zeros in microbiota studies. The approach represents compositions through a square-root…
Estimating causal effects from high-dimensional, structured exposures is a fundamental challenge in modern applications ranging from neuroscience and finance to environmental science. While the literature has addressed high-dimensional…
Hyperbolic space is increasingly used for hierarchical, tree-like, and network-structured data, but likelihood-based density modeling on hyperbolic space remains relatively limited. This paper develops finite mixture modeling with isotropic…
The increased use of differential privacy (DP) has allowed the sharing of large amounts of data while reducing the risk of disclosure of sensitive information at the individual level. However, the noise introduced by DP methods makes…
Mendelian randomization is a powerful tool for causal inference in observational studies. The two-sample summary-data design, which estimates genetic associations with exposures and outcomes in separate cohorts, is the most widely used…