机器学习 — Scifaro

Accounting for shared covariates in semi-parametric Bayesian additive regression trees

We propose some extensions to semi-parametric models based on Bayesian additive regression trees (BART). In the semi-parametric BART paradigm, the response variable is approximated by a linear predictor and a BART model, where the linear…

机器学习 · 统计学 2026-03-10 Estevão B. Prado , Andrew C. Parnell , Keefe Murphy , Nathan McJames , Ann O'Shea , Rafael A. Moral

SPPCSO: Adaptive Penalized Estimation Method for High-Dimensional Correlated Data

With the rise of high-dimensional correlated data, multicollinearity poses a significant challenge to model stability, often leading to unstable estimation and reduced predictive accuracy. This work proposes the Single-Parametric Principal…

机器学习 · 统计学 2026-03-09 Ying Hu , Hu Yang

Prediction-Powered Conditional Inference

We study prediction-powered conditional inference in the setting where labeled data are scarce, unlabeled covariates are abundant, and a black-box machine-learning predictor is available. The goal is to perform statistical inference on…

机器学习 · 统计学 2026-03-09 Yang Sui , Jin Zhou , Hua Zhou , Xiaowu Dai

Learning Optimal Distributionally Robust Individualized Treatment Rules Integrating Multi-Source Data

Integrative analysis of multiple datasets for estimating optimal individualized treatment rules (ITRs) can enhance decision efficiency. A central challenge is posterior shift, wherein the conditional distribution of potential outcomes given…

机器学习 · 统计学 2026-03-09 Wenhai Cui , Wen Su , Xingqiu Zhao

Escaping Model Collapse via Synthetic Data Verification: Near-term Improvements and Long-term Convergence

Synthetic data has been increasingly used to train frontier generative models. However, recent studies raise key concerns that iteratively retraining a generative model on its self-generated synthetic data may keep deteriorating model…

机器学习 · 统计学 2026-03-09 Bingji Yi , Qiyuan Liu , Yuwei Cheng , Haifeng Xu

Self-Speculative Masked Diffusions

We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data that require significantly fewer function evaluations to generate samples. Standard masked diffusion models predict…

机器学习 · 统计学 2026-03-09 Andrew Campbell , Valentin De Bortoli , Jiaxin Shi , Arnaud Doucet

Spectral/Spatial Tensor Atomic Cluster Expansion with Universal Embeddings in Cartesian Space

Equivariant atomistic machine learning models have largely been built on spherical-tensor representations, where explicit angular-momentum coupling introduces substantial complexity and systematic extensions beyond energies and forces…

机器学习 · 统计学 2026-03-09 Zemin Xu , Wenbo Xie , P. Hu

Kernel VICReg for Self-Supervised Learning in Reproducing Kernel Hilbert Space

Self-supervised learning (SSL) has emerged as a powerful paradigm for representation learning by optimizing geometric objectives, such as invariance to augmentations, variance preservation, and feature decorrelation, without requiring…

机器学习 · 统计学 2026-03-09 M. Hadi Sepanj , Benyamin Ghojogh , Saed Moradi , Paul Fieguth

Thermodynamic Response Functions in Singular Bayesian Models

Singular statistical models-including mixtures, matrix factorization, and neural networks-violate regular asymptotics due to parameter non-identifiability and degenerate Fisher geometry. Although singular learning theory characterizes…

机器学习 · 统计学 2026-03-06 Sean Plummer

Harnessing Synthetic Data from Generative AI for Statistical Inference

The emergence of generative AI models has dramatically expanded the availability and use of synthetic data across scientific, industrial, and policy domains. While these developments open new possibilities for data analysis, they also raise…

机器学习 · 统计学 2026-03-06 Ahmad Abdel-Azim , Ruoyu Wang , Xihong Lin

How important are the genes to explain the outcome - the asymmetric Shapley value as an honest importance metric for high-dimensional features

In clinical prediction settings the importance of a high-dimensional feature like genomics is often assessed by evaluating the change in predictive performance when adding it to a set of traditional clinical variables. This approach is…

机器学习 · 统计学 2026-03-06 Mark A. van de Wiel , Jeroen Goedhart , Martin Jullum , Kjersti Aas

Bayesian Supervised Causal Clustering

Finding patient subgroups with similar characteristics is crucial for personalized decision-making in various disciplines such as healthcare and policy evaluation. While most existing approaches rely on unsupervised clustering methods,…

机器学习 · 统计学 2026-03-06 Luwei Wang , Nazir Lone , Sohan Seth

Learning Optimal Individualized Decision Rules with Conditional Demographic Parity

Individualized decision rules (IDRs) have become increasingly prevalent in societal applications such as personalized marketing, healthcare, and public policy design. However, a critical ethical concern arises from the potential…

机器学习 · 统计学 2026-03-06 Wenhai Cui , Wen Su , Donglin Zeng , Xingqiu Zhao

How Does the ReLU Activation Affect the Implicit Bias of Gradient Descent on High-dimensional Neural Network Regression?

Overparameterized ML models, including neural networks, typically induce underdetermined training objectives with multiple global minima. The implicit bias refers to the limiting global minimum that is attained by a common optimization…

机器学习 · 统计学 2026-03-06 Kuo-Wei Lai , Guanghui Wang , Molei Tao , Vidya Muthukumar

Optimal Prediction-Augmented Algorithms for Testing Independence of Distributions

Independence testing is a fundamental problem in statistical inference: given samples from a joint distribution $p$ over multiple random variables, the goal is to determine whether $p$ is a product distribution or is $\epsilon$-far from all…

机器学习 · 统计学 2026-03-06 Maryam Aliakbarpour , Alireza Azizi , Ria Stevens

Bayesian Modeling of Collatz Stopping Times: A Probabilistic Machine Learning Perspective

We study the Collatz total stopping time $\tau(n)$ over $n\le 10^7$ from a probabilistic machine learning viewpoint. Empirically, $\tau(n)$ is a skewed and heavily overdispersed count with pronounced arithmetic heterogeneity. We develop two…

机器学习 · 统计学 2026-03-06 Nicolò Bonacorsi , Matteo Bordoni

Dictionary Based Pattern Entropy for Causal Direction Discovery

Discovering causal direction from temporal observational data is particularly challenging for symbolic sequences, where functional models and noise assumptions are often unavailable. We propose a novel \emph{Dictionary Based Pattern Entropy…

机器学习 · 统计学 2026-03-06 Harikrishnan N B , Shubham Bhilare , Aditi Kathpalia , Nithin Nagaraj

Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets

Access to multiple predictive models trained for the same task, whether in regression or classification, is increasingly common in many applications. Aggregating their predictive uncertainties to produce reliable and efficient uncertainty…

机器学习 · 统计学 2026-03-06 Nabil Alami , Jad Zakharia , Souhaib Ben Taieb

Testing Most Influential Sets

Small influential data subsets can dramatically impact model conclusions, with a few data points overturning key findings. While recent work identifies these most influential sets, there is no formal way to tell when maximum influence is…

机器学习 · 统计学 2026-03-06 Lucas Darius Konrad , Nikolas Kuschnig

Quantitative convergence of trained single layer neural networks to Gaussian processes

In this paper, we study the quantitative convergence of shallow neural networks trained via gradient descent to their associated Gaussian processes in the infinite-width limit. While previous work has established qualitative convergence…

机器学习 · 统计学 2026-03-06 Eloy Mosig , Andrea Agazzi , Dario Trevisan