机器学习 — Scifaro

Wasserstein multivariate auto-regressive models for modeling distributional time series

This paper is focused on the statistical analysis of data consisting of a collection of multiple series of probability measures that are indexed by distinct time instants and supported over a bounded interval of the real line. By modeling…

机器学习 · 统计学 2026-05-05 Yiye Jiang , Jérémie Bigot

Online Graph Topology Learning from Matrix-valued Time Series

The focus is on the statistical analysis of matrix-valued time series, where data is collected over a network of sensors, typically at spatial locations, over time. Each sensor records a vector of features at each time point, creating a…

机器学习 · 统计学 2026-05-05 Yiye Jiang , Jérémie Bigot , Sofian Maabout

Decentralized Proximal Stochastic Gradient Langevin Dynamics

We propose Decentralized Proximal Stochastic Gradient Langevin Dynamics (DE-PSGLD), a decentralized Markov chain Monte Carlo (MCMC) algorithm for sampling from a log-concave probability distribution constrained to a convex domain.…

机器学习 · 统计学 2026-05-04 Mohammad Rafiqul Islam , Lingjiong Zhu

Adaptive Querying with AI Persona Priors

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight question budgets. Classical Bayesian design and computerized adaptive testing…

机器学习 · 统计学 2026-05-04 Kaizheng Wang , Yuhang Wu , Assaf Zeevi

Gradient Regularized Newton Boosting Trees with Global Convergence

Gradient Boosting Decision Trees (GBDTs) dominate tabular machine learning, with modern implementations like XGBoost, LightGBM, and CatBoost being based on Newton boosting: a second-order descent step in the space of decision trees. Despite…

机器学习 · 统计学 2026-05-04 Nikita Zozoulenko , Daniel Falkowski , Thomas Cass , Lukas Gonon

Information-geometric adaptive sampling for graph diffusion

Standard diffusion models for graph generation typically rely on uniform time-stepping, an approach that overlooks the non-homogeneous dynamics of distributional evolution on complex manifolds. In this paper, we present an…

机器学习 · 统计学 2026-05-04 Yuhui Lu , Wenjing Liu , Kun Zhan

A unified perspective on fine-tuning and sampling with diffusion and flow models

We study the problem of training diffusion and flow generative models to sample from target distributions defined by an exponential tilting of a base density; a formulation that subsumes both sampling from unnormalized densities and reward…

机器学习 · 统计学 2026-05-04 Carles Domingo-Enrich , Yuanqi Du , Michael S. Albergo

SHIFT: Robust Double Machine Learning for Average Dose-Response Functions under Heavy-Tailed Contamination

Double-machine-learning pipelines for the Average Dose-Response Function rely on kernel-weighted local-linear smoothers, which inherit unbounded functional influence: a single outlier within a kernel window biases the curve across the…

机器学习 · 统计学 2026-05-04 Eichi Uehara

Adaptive Norm-Based Regularization for Neural Networks

In this paper, we study norm-based regularization methods for neural networks. We compare existing penalization approaches and introduce two regularization strategies that extend classical ridge- and lasso-type penalties to neural network…

机器学习 · 统计学 2026-05-04 Muhammad Qasim , Farrukh Javed

Generative Modeling under Non-Monotone MAR Missingness via Approximate Wasserstein Gradient Flows

The prevalence of missing values in data science poses a substantial risk to any further analyses. Despite a wealth of research, principled nonparametric methods to deal with general non-monotone missingness are still scarce. Instead,…

机器学习 · 统计学 2026-05-04 Gitte Kremling , Jeffrey Näf , Johannes Lederer

Statistical Testing Framework for Clustering Pipelines by Selective Inference

A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating multiple analysis algorithms. In many practical applications, analytical findings are obtained only after data pass…

机器学习 · 统计学 2026-05-04 Yugo Miyata , Tomohiro Shiraishi , Shuichi Nishino , Ichiro Takeuchi

Minimizing Human Intervention in Online Classification

Training or fine-tuning large language model (LLM)-based systems often requires costly human feedback, yet there is limited understanding of how to minimize such intervention while maintaining strong error guarantees. We study this problem…

机器学习 · 统计学 2026-05-04 William Réveillard , Vasileios Saketos , Alexandre Proutiere , Richard Combes

Doubly robust identification of treatment effects from multiple environments

Practical and ethical constraints often require the use of observational data for causal inference, particularly in medicine and social sciences. Yet, observational datasets are prone to confounding, potentially compromising the validity of…

机器学习 · 统计学 2026-05-04 Piersilvio De Bartolomeis , Julia Kostin , Javier Abad , Yixin Wang , Fanny Yang

Prediction-powered Inference by Mixture of Experts

The rapidly expanding artificial intelligence (AI) industry has produced diverse yet powerful prediction tools, each with its own network architecture, training strategy, data-processing pipeline, and domain-specific strengths. These tools…

机器学习 · 统计学 2026-05-01 Yanwu Gu , Linglong Kong , Dong Xia

Bayesian X-Learner: Calibrated Posterior Inference for Heterogeneous Treatment Effects under Heavy-Tailed Outcomes

Conditional Average Treatment Effect (CATE) estimation in practice demands three properties simultaneously: heterogeneous effects $\tau(x)$, calibrated uncertainty over them, and robustness to the heavy tails that contaminate real outcome…

机器学习 · 统计学 2026-05-01 Eichi Uehara

SCOPE-FE: Structured Control of Operator and Pairwise Exploration for Feature Engineering

Automatic feature engineering is an effective approach for improving predictive performance in tabular learning. However, expand-and-reduce methods, such as OpenFE, become increasingly computationally expensive as the input dimensionality…

机器学习 · 统计学 2026-05-01 Minhee Park , Seongyeon Son , Yonghyun Lee , Eunchan Kim

DDO-RM: Distribution-Level Policy Improvement after Reward Learning

Recent theory suggests that reward-model-first methods can be more sample-efficient than direct policy fitting when the reward function is statistically simpler than the induced policy. We propose DDO-RM, a finite-candidate…

机器学习 · 统计学 2026-05-01 Tiantian Zhang , Jierui Zuo , Michael Chen , Wenping Wang

EmDT: Embedding Diffusion Transformer for Tabular Data Generation in Fraud Detection

Imbalanced datasets pose a difficulty in fraud detection, as classifiers are often biased toward the majority class and perform poorly on rare fraudulent transactions. Synthetic data generation is therefore commonly used to mitigate this…

机器学习 · 统计学 2026-05-01 En-Ya Kuo , Sebastien Motsch

Bayesian Hierarchical Models and the Maximum Entropy Principle

Bayesian hierarchical models are frequently used in practical data analysis contexts. One interpretation of these models is that they provide an indirect way of assigning a prior for unknown parameters, through the introduction of…

机器学习 · 统计学 2026-05-01 Brendon J. Brewer

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

This paper studies the high-dimensional scaling limits of online stochastic gradient descent (SGD). Building on the recent work of Ben Arous, Gheissari, and Jagannath on the effective dynamics of SGD, we study the critical scaling regime of…

机器学习 · 统计学 2026-05-01 Parsa Rangriz