机器学习 — Scifaro

Robust estimation of heterogeneous treatment effects in randomized trials leveraging external data

Randomized trials are typically designed to detect average treatment effects but often lack the statistical power to uncover individual-level treatment effect heterogeneity, limiting their value for personalized decision-making. To address…

机器学习 · 统计学 2026-03-19 Rickard Karlsson , Piersilvio De Bartolomeis , Issa J. Dahabreh , Jesse H. Krijthe

Statistical Inference for Online Algorithms

The construction of confidence intervals and hypothesis tests for functionals is a cornerstone of statistical inference. Traditionally, the most efficient procedures - such as the Wald interval or the Likelihood Ratio Test - require both a…

机器学习 · 统计学 2026-03-19 Selina Carter , Arun K Kuchibhotla

Minimum Volume Conformal Sets for Multivariate Regression

Conformal prediction provides a principled framework for constructing predictive sets with finite-sample validity. While much of the focus has been on univariate response variables, existing multivariate methods either impose rigid…

机器学习 · 统计学 2026-03-19 Sacha Braun , Liviu Aolaritei , Michael I. Jordan , Francis Bach

How PC-based Methods Err: Towards Better Reporting of Assumption Violations and Small Sample Errors

Causal discovery methods based on the PC algorithm are proven to be sound if all structural assumptions are fulfilled and all conditional independence tests are correct. This idealized setting is rarely given in real data. In this work, we…

机器学习 · 统计学 2026-03-19 Sofia Faltenbacher , Jonas Wahl , Rebecca Herman , Jakob Runge

Conditional Distributional Treatment Effects: Doubly Robust Estimation and Testing

Beyond conditional average treatment effects, treatments may impact the entire outcome distribution in covariate-dependent ways, for example, by altering the variance or tail risks for specific subpopulations. We propose a novel estimand to…

机器学习 · 统计学 2026-03-18 Saksham Jain , Alex Luedtke

Safe Distributionally Robust Feature Selection under Covariate Shift

In practical machine learning, the environments encountered during the model development and deployment phases often differ, especially when a model is used by many users in diverse settings. Learning models that maintain reliable…

机器学习 · 统计学 2026-03-18 Hiroyuki Hanada , Satoshi Akahane , Noriaki Hashimoto , Shion Takeno , Ichiro Takeuchi

Learning to Recall with Transformers Beyond Orthogonal Embeddings

Modern large language models (LLMs) excel at tasks that require storing and retrieving knowledge, such as factual recall and question answering. Transformers are central to this capability because they can encode information during training…

机器学习 · 统计学 2026-03-18 Nuri Mert Vural , Alberto Bietti , Mahdi Soltanolkotabi , Denny Wu

Beyond Distance: Quantifying Point Cloud Dynamics with Persistent Homology and Dynamic Optimal Transport

We introduce a framework for analyzing topological tipping in time-evolutionary point clouds by extending the recently proposed Topological Optimal Transport (TpOT) distance. While TpOT unifies geometric, homological, and higher-order…

机器学习 · 统计学 2026-03-18 Yixin Wang , Ting Gao , Jinqiao Duan

Analyzing Error Sources in Global Feature Effect Estimation

Global feature effects such as partial dependence (PD) and accumulated local effects (ALE) plots are widely used to interpret black-box models. However, they are only estimates of true underlying effects, and their reliability depends on…

机器学习 · 统计学 2026-03-18 Timo Heiß , Coco Bögel , Bernd Bischl , Giuseppe Casalicchio

Near-Optimal Clustering in Mixture of Markov Chains

We study the problem of clustering $T$ trajectories of length $H$, each generated by one of K unknown ergodic Markov chains over a finite state space of size $S$. We derive an instance-dependent, high-probability lower bound on the…

机器学习 · 统计学 2026-03-18 Junghyun Lee , Yassir Jedra , Alexandre Proutière , Se-Young Yun

ALPCAH: Subspace Learning for Sample-wise Heteroscedastic Data

Principal component analysis (PCA) is a key tool in the field of data dimensionality reduction. However, some applications involve heterogeneous data that vary in quality due to noise characteristics associated with each data sample.…

机器学习 · 统计学 2026-03-18 Javier Salazar Cavazos , Jeffrey A. Fessler , Laura Balzano

On the minimax optimality of Flow Matching through the connection to kernel density estimation

Flow Matching has recently gained attention in generative modeling as a simple and flexible alternative to diffusion models. While existing statistical guarantees adapt tools from the analysis of diffusion models, we take a different…

机器学习 · 统计学 2026-03-18 Lea Kunkel , Mathias Trabs

Inference for Deep Neural Network Estimators in Generalized Nonparametric Models

While deep neural networks (DNNs) are used for prediction, inference on DNN-estimated subject-specific means for categorical or exponential family outcomes remains underexplored. We address this by proposing a DNN estimator under…

机器学习 · 统计学 2026-03-18 Xuran Meng , Yi Li

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Cells

Motivation: Networks underlie the generation and interpretation of many biological datasets: gene networks shed light on the regulatory structure of the genome, and cell networks can capture structure of the tumor micro-environment.…

机器学习 · 统计学 2026-03-18 Bailey Andrew , Erica L. Harris , James A. Poulter , David R. Westhead , Luisa Cutillo

Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on…

机器学习 · 统计学 2026-03-18 Abdoulaye Sakho , Emmanuel Malherbe , Erwan Scornet

Estimating Staged Event Tree Models via Hierarchical Clustering on the Simplex

Staged tree models enhance Bayesian networks by incorporating context-specific dependencies through a stage-based structure. In this study, we present a new framework for estimating staged trees using hierarchical clustering on the…

机器学习 · 统计学 2026-03-17 Muhammad Shoaib , Eva Riccomagno , Manuele Leonelli , Gherardo Varando

Persistence Spheres: a Bi-continuous Linear Representation of Measures for Partial Optimal Transport

We improve and extend persistence spheres, introduced in~\cite{pegoraro2025persistence}. Persistence spheres map an integrable measure $\mu$ on the upper half-plane, including persistence diagrams (PDs) as counting measures, to a function…

机器学习 · 统计学 2026-03-17 Matteo Pegoraro

Active Seriation: Efficient Ordering Recovery with Statistical Guarantees

Active seriation aims at recovering an unknown ordering of $n$ items by adaptively querying pairwise similarities. The observations are noisy measurements of entries of an underlying $n$ x $n$ permuted Robinson matrix, whose permutation…

机器学习 · 统计学 2026-03-17 James Cheshire , Yann Issartel

Scalable Simulation-Based Model Inference with Test-Time Complexity Control

Simulation plays a central role in scientific discovery. In many applications, the bottleneck is no longer running a simulator; it is choosing among large families of plausible simulators, each corresponding to different forward…

机器学习 · 统计学 2026-03-17 Manuel Gloeckler , J. P. Manzano-Patrón , Stamatios N. Sotiropoulos , Cornelius Schröder , Jakob H. Macke

The Sampling Complexity of Condorcet Winner Identification in Dueling Bandits

We study best-arm identification in stochastic dueling bandits under the sole assumption that a Condorcet winner exists, i.e., an arm that wins each noisy pairwise comparison with probability at least $1/2$. We introduce a new…

机器学习 · 统计学 2026-03-17 El Mehdi Saad , Victor Thuot , Nicolas Verzelen