机器学习 — Scifaro

On The Hidden Biases of Flow Matching Samplers

Flow matching (FM) constructs continuous-time ODE samplers by prescribing probability paths between a base distribution and a target distribution. In this note, we study FM through the lens of finite-sample plug-in estimation. In addition…

机器学习 · 统计学 2026-05-15 Soon Hoe Lim

A New Framework for Convex Clustering in Kernel Spaces: Finite Sample Bounds, Consistency and Performance Insights

Convex clustering is a well-regarded clustering method, resembling the similar centroid-based approach of Lloyd's $k$-means, without requiring a predefined cluster count. It starts with each data point as its centroid and iteratively merges…

机器学习 · 统计学 2026-05-15 Shubhayan Pan , Kushal Bose , Debolina Paul , Saptarshi Chakraborty , Swagatam Das

Generative Bayesian Optimization: Generative Models as Acquisition Functions

We present a general strategy for turning generative models into candidate solution samplers for batch Bayesian optimization (BO). The use of generative models for BO enables large batch scaling as generative sampling, optimization of…

机器学习 · 统计学 2026-05-15 Rafael Oliveira , Daniel M. Steinberg , Edwin V. Bonilla

Manifold Dimension Estimation via Local Graph Structure

Most existing manifold dimension estimators rely on the assumption that the underlying manifold is locally flat within the neighborhoods under consideration. More recently, curvature-adjusted principal component analysis (CA-PCA) has…

机器学习 · 统计学 2026-05-15 Zelong Bi , Pierre Lafaye de Micheaux

On the Identifiability of Causal Graphs with the Invariance Principle

Causal discovery from i.i.d. observational data is known to be generally ill-posed. We demonstrate that if we have access to the distribution {induced} by a structural causal model, and additional data from (in the best case) \textit{only…

机器学习 · 统计学 2026-05-15 Francesco Montagna

Stochastic dynamics learning with state-space systems

This work advances the theoretical foundations of reservoir computing (RC) by providing a unified treatment of fading memory and the echo state property (ESP) in both deterministic and stochastic settings. We investigate state-space…

机器学习 · 统计学 2026-05-15 Juan-Pablo Ortega , Florian Rossmannek

Scalable Subset Selection in Linear Mixed Models

Linear mixed models (LMMs), which incorporate fixed and random effects, are key tools for analyzing heterogeneous data, such as in personalized medicine. Nowadays, this type of data is increasingly wide, sometimes containing thousands of…

机器学习 · 统计学 2026-05-15 Ryan Thompson , Matt P. Wand , Joanna J. J. Wang

A Unified Framework for Entropy Search and Expected Improvement in Bayesian Optimization

Bayesian optimization is a widely used method for optimizing expensive black-box functions, with Expected Improvement being one of the most commonly used acquisition functions. In contrast, information-theoretic acquisition functions aim to…

机器学习 · 统计学 2026-05-15 Nuojin Cheng , Leonard Papenmeier , Stephen Becker , Luigi Nardi

How well behaved is finite dimensional Diffusion Maps?

Under a set of assumptions on a family of submanifolds $\subset {\mathbb R}^D$, we derive a series of geometric properties that remain valid after finite-dimensional and almost isometric Diffusion Maps (DM), including almost uniform…

机器学习 · 统计学 2026-05-15 Wenyu Bo , Marina Meilă

Distributional Principal Autoencoders

Dimension reduction techniques usually lose information in the sense that reconstructed data are not identical to the original data. However, we argue that it is possible to have reconstructed data identically distributed as the original…

机器学习 · 统计学 2026-05-15 Xinwei Shen , Nicolai Meinshausen

Change of measure through the Legendre transform

PAC-Bayes generalisation bounds are derived via change-of-measure inequalities that transfer concentration properties from a reference measure to all posterior measures. The specific choice of change of measure determines the assumptions…

机器学习 · 统计学 2026-05-15 Antoine Picard-Weibel , Benjamin Guedj

Robust and Sparse Regression in GLM by Stochastic Optimization

The generalized linear model (GLM) plays a key role in regression analyses. In high-dimensional data, the sparse GLM has been used but it is not robust against outliers. Recently, the robust methods have been proposed for the specific…

机器学习 · 统计学 2026-05-15 Takayuki Kawashima , Hironori Fujisawa

What is Learnable in Valiant's Theory of the Learnable?

Valiant's 1984 paper is widely credited with introducing the PAC learning model, but it, in fact, introduced a different model: unlike PAC learning, the learner receives only positives, may issue membership queries, and must output a…

机器学习 · 统计学 2026-05-14 Steve Hanneke , Anay Mehrotra , Grigoris Velegkas , Manolis Zampetakis

Conformal Anomaly Detection in Python: Moving Beyond Heuristic Thresholds with 'nonconform'

Most anomaly detection systems output scores rather than calibrated decisions, leaving practitioners to choose thresholds heuristically and without clear statistical interpretation. Conformal anomaly detection addresses this limitation by…

机器学习 · 统计学 2026-05-14 Oliver Hennhöfer , Maximilian Kirsch , Christine Preisach

Causal Learning with the Invariance Principle

Causal discovery, the problem of inferring the direction of causality, is generally ill-posed. We use the language of structural causal models (SCM) to show that assuming that the causal relations are acyclic and invariant across multiple…

机器学习 · 统计学 2026-05-14 Francesco Montagna , Francesco Locatello

On the Limits of Latent Reuse in Diffusion Models

Diffusion models are often trained in low-dimensional latent spaces, which are then reused for related but shifted datasets. In this work, we study when such latent reuse remains reliable under distribution shift. We consider a…

机器学习 · 统计学 2026-05-14 Yifeng Yu , Lu Yu

Learning Perturbations to Extrapolate Your LLM

Recent advancements in large language models demonstrate that injecting perturbations can substantially enhance extrapolation performance. However, current approaches often rely on discrete perturbations with fixed designs, which limits…

机器学习 · 统计学 2026-05-14 Zetai Cen , Chenfei Gu , Jin Zhu , Ting Li , Yunxiao Chen , Chengchun Shi

The Sample Complexity of Multiple Change Point Identification under Bandit Feedback

We study multiple change point localization under bandit feedback. An unknown piecewise-constant function on a compact interval can be queried sequentially at adaptively chosen inputs, and each query returns a noisy evaluation of the…

机器学习 · 统计学 2026-05-14 Maximilian Graf , Victor Thuot

LLMs as Implicit Imputers: Uncertainty Should Scale with Missing Information

Large language models (LLMs) are increasingly deployed in settings where the available context is incomplete or degraded. We argue that an LLM generating answers under incomplete context can be viewed as an implicit imputer, and evaluated…

机器学习 · 统计学 2026-05-14 Stef van Buuren

Coupling-Informed Transport Maps for Bayesian Filtering in Nonlinear Dynamical Systems

A likelihood-free transport filtering method is proposed based on the couplings between state and observation variables. By exploiting a block-triangular structure in the transport map, the analysis step of filtering is reformulated as the…

机器学习 · 统计学 2026-05-14 Dengfei Zeng , Lijian Jiang , Shuyu Sun , Dunhui Xiao