机器学习 — Scifaro

Kernel-based guarantees for nonlinear parametric models in Bayesian optimization

Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on…

机器学习 · 统计学 2026-05-14 Rafael Oliveira

Generative Modeling of Approximately Periodic Time Series by a Posterior-Weighted Gaussian Process

Discrete automated processes in industrial and cyber-physical systems often exhibit a repetitive structure in which successive repetitions follow a common trajectory while differing in duration, amplitude, and fine-scale dynamics. Such…

机器学习 · 统计学 2026-05-14 Elias Reich , Saverio Messineo , Stefan Huber

On Hallucinations in Inverse Problems: Fundamental Limits and Provable Assessment Methods

Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their…

机器学习 · 统计学 2026-05-14 David Iagaru , Nina M. Gottschling , Anders C. Hansen , Josselin Garnier

Amortized Neural Clustering of Time Series based on Statistical Features

This paper introduces an algorithm-agnostic approach to feature-based time series clustering via amortized neural inference. By training neural networks to approximate the optimal partitioning rule from simulated data, the proposed…

机器学习 · 统计学 2026-05-14 Ángel López-Oriona , Ying Sun

State-of-art minibatches via novel DPP kernels: discretization, wavelets, and rough objectives

Determinantal point processes (DPPs) have emerged as a kernelized alternative to vanilla independent sampling for generating efficient minibatches, coresets and other parsimonious representations of large-scale datasets. While theoretical…

机器学习 · 统计学 2026-05-14 Hoang-Son Tran , Pranav Gupta , Rémi Bardenet , Subhroshekhar Ghosh

Adaptive Kernel Density Estimation with Pre-training

Density estimation in high-dimensional settings is an important and challenging statistical problem.Traditional methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate…

机器学习 · 统计学 2026-05-14 Ruitong Zhang , Ke Deng

Coreset-Induced Conditional Velocity Flow Matching

We propose Coreset-Induced Conditional Velocity Flow Matching (CCVFM), a generative model that augments hierarchical rectified flow with a data-informed source distribution. Hierarchical flow matching models the full conditional velocity…

机器学习 · 统计学 2026-05-14 Xiao Wang , Zihua She , Jianxi Su

When Should an AI Workflow Release? Always-Valid Inference for Black-Box Generate-Verify Systems

LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This…

机器学习 · 统计学 2026-05-14 Young Hyun Cho , Will Wei Sun

The Mechanism of Weak-to-Strong Generalization: Feature Elicitation from Latent Knowledge

Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the…

机器学习 · 统计学 2026-05-14 Ryoya Awano , Taiji Suzuki

Robust Sequential Experimental Design for A/B Testing

Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model…

机器学习 · 统计学 2026-05-14 Qianglin Wen , Xiangkun Wu , Chengchun Shi , Ting Li , Niansheng Tang , Yingying Zhang , Hongtu Zhu

ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks

Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with fully…

机器学习 · 统计学 2026-05-14 Zhizhen Zhang , Hyemin Gu , Benjamin J. Zhang , Daniel Elenius , Michael Tyrrell , Theo J. Bourdais , Houman Owhadi , Markos A. Katsoulakis , Tuhin Sahai

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$…

机器学习 · 统计学 2026-05-14 Tomohiro Hayase , Ryo Karakida

Online Conformal Prediction: Enforcing monotonicity via Online Optimization

Conformal prediction provides a principled framework for uncertainty quantification with finite-sample coverage guarantees. While recent work has extended conformal prediction to online and sequential settings, existing methods typically…

机器学习 · 统计学 2026-05-14 Eduardo Ochoa Rivera , Ambuj Tewari

Distribution Shift in Missing Data Imputation: A Risk-Based Perspective and Importance-Weighted Correction under MAR

Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk…

机器学习 · 统计学 2026-05-14 Luke Shannon , Song Liu , Katarzyna Reluga

Diffusion Model's Generalization Can Be Characterized by Inductive Biases toward a Data-Dependent Ridge Manifold

We study a data-dependent notion of diffusion-model generalization: when a model does not memorize the training set, where do its generated samples go relative to the geometry induced by the data? To answer this, we introduce a…

机器学习 · 统计学 2026-05-14 Ye He , Yitong Qiu , Molei Tao

Efficient Generative Prediction for EHR Foundation Models: The SCOPE and REACH Estimators

Generative foundation models trained on tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction via Monte Carlo sampling of simulated future trajectories. However, this approach suffers from three…

机器学习 · 统计学 2026-05-14 Luke Solo , Matthew B. A. McDermott , William F. Parker , Bashar Ramadan , Michael C. Burkhart , Brett K. Beaulieu-Jones

Plug-In Classification of Drift Functions in Diffusion Processes Using Neural Networks

We study supervised multiclass classification for diffusion processes, where each class is characterized by a distinct drift function and trajectories are observed at discrete times. We first derive a multidimensional Bayes rule and then…

机器学习 · 统计学 2026-05-14 Yuzhen Zhao , Jiarong Fan , Yating Liu

When to Transfer: Adaptive Source Selection for Positive Transfer in Linear Models

In many business settings, task-specific labeled data are scarce or costly to obtain, limiting supervised learning on a target task. A classical response is transfer learning (TL). Many TL works study how to transfer information from…

机器学习 · 统计学 2026-05-14 Hamza Cherkaoui , Hélène Halconruy , Yohan Petetin

Sample-Efficient Optimisation over the Outputs of Generative Models

Modern generative AI models, such as diffusion and flow matching models, can sample from rich data distributions. However, many applications, especially in science and engineering, require more than drawing samples from the model…

机器学习 · 统计学 2026-05-14 Samuel Willis , Paul Duckworth , Jack Simons , Aleksandra Kalisz , Krisztina Sinkovics , Noam Ghenassia , Shikha Surana , Henry T. Oldroyd , Alexandru I. Stere , Dragos D Margineantu , Carl Henrik Ek , Henry Moss , Erik Bodin

Geometric Autoencoder Priors for Bayesian Inversion: Learn First Observe Later

Uncertainty Quantification (UQ) is paramount for inference in engineering. A common inference task is to recover full-field information of physical systems from a small number of noisy observations, a usually highly ill-posed problem.…

机器学习 · 统计学 2026-05-14 Arnaud Vadeboncoeur , Gregory Duthé , Mark Girolami , Eleni Chatzi