Statistics
Efficient benchmarking techniques aim to lower the computational cost of evaluating LLMs by predicting full benchmark scores using only a subset of a benchmark's questions. By reframing this problem as an instance of multiple regression…
In multi-modal biomedical research, integrating high-dimensional genomic data with clinical baselines is essential for precision medicine. However, standard deep neural network approaches often entangle these modalities, obscuring the…
This paper proposes StrTransformer, a source-wise structured Transformer framework for blind source recovery and branch-wise latent modeling. Instead of using an encoder to infer latent variables, StrTransformer directly optimizes the…
The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates…
We study optimal experimental design for multinomial logit (MNL) bandits, where an agent repeatedly selects a subset of $K$ items from a ground set of size $N$ and observes single-choice feedback. Unlike linear or generalized linear…
We study nonstationary generalized linear bandits (GLBs), where the expected reward is modeled through a nonlinear link function with an unknown time-varying parameter. This framework encompasses a broad class of reward models, including…
We study the geometry of determinantal point processes (DPPs) through the spectral decomposition $L=U\Lambda U^{\top}$. The spectrum $\Lambda$ governs the cardinality distribution via elementary symmetric polynomials, while the eigenspace…
Reconstructing PDE solutions from sparse observations is a core challenge in scientific computing. We present FM4PDE, a flow-matching generative framework that learns the joint distribution of PDE coefficients (or initial states) and…
Directed acyclic graphs provide a fundamental tool for representing directed dependence structures in multivariate network data, and are widely used to model financial and economic networks. However, accurate and interpretable estimation…
The use of ordinal patterns (OPs) for analyzing the dependence structure of univariate and continuously distributed processes has gained popularity in recent years. This research goes one step further and considers the transcripts being…
Removing noise is difficult, but adding noise is easy. In this work, we show how to eliminate mean-shift noisy components from PCA by deliberately introducing knockoff mean-shift perturbation. Standard PCA is highly sensitive to shifts in…
Graph Neural Networks (GNN) are currently the most popular approach for learning and prediction on graph-structured data and are deployed in various fields, from social network analysis to drug discovery. However, there is limited…
We consider graph diffusion processes constructed from finite i.i.d. samples drawn from an unknown manifold embedded in ambient Euclidean space, where the graph affinity is defined by an ambient Gaussian kernel matrix. We show that the…
We consider the problem of testing mutual independence among the components of a high-dimensional random vector. Building on the rank-based max-sum framework, we introduce fixed finite-$L_q$ power-sum statistics under three general classes…
Online experiments in ads, recommendation, and member-experience systems are often planned before the dominant interference mechanism is known. A treatment may propagate through budgets, inventory, producer exposure, graph spillovers, or…
We analyze a fixed panel of S\&P 500 stocks from 1996 to 2026 using complementary static and kinetic Ising models applied to daily binary open-to-close movements. The static pairwise model provides a long-run maximum-entropy summary of…
Kernel Stein discrepancy (KSD) is among the most popular goodness-of-fit (GoF) measures on general domains with a large number of successful deployments. One of the main applications of KSD is in constructing powerful GoF tests. However,…
This article is the rejoinder to ``The ICML 2023 Ranking Experiment: Examining Author Self-Assessment in ML/AI Peer Review,'' to appear in the Journal of the American Statistical Association with discussion. To address the practical and…
Reinforcement learning algorithms are generally designed to maximize the expected return across a population. However, a policy that is optimal on average may be suboptimal for certain individuals, leading to potential safety concerns. To…
Integrating multimodal datasets in clinical oncology is frequently hindered by high dimensionality and blockwise missingness, where entire data sources are unavailable for specific patient subsets. Standard survival models often struggle…