机器学习 — Scifaro

A Connection Between Learning to Reject and Bhattacharyya Divergences

Learning to reject provide a learning paradigm which allows for our models to abstain from making predictions. One way to learn the rejector is to learn an ideal marginal distribution (w.r.t. the input domain) - which characterizes a…

机器学习 · 统计学 2025-05-09 Alexander Soen

A Two-Sample Test of Text Generation Similarity

The surge in digitized text data requires reliable inferential methods on observed textual patterns. This article proposes a novel two-sample text test for comparing similarity between two groups of documents. The hypothesis is whether the…

机器学习 · 统计学 2025-05-09 Jingbin Xu , Chen Qian , Meimei Liu , Feng Guo

Boosting Statistic Learning with Synthetic Data from Pretrained Large Models

The rapid advancement of generative models, such as Stable Diffusion, raises a key question: how can synthetic data from these models enhance predictive modeling? While they can generate vast amounts of datasets, only a subset meaningfully…

机器学习 · 统计学 2025-05-09 Jialong Jiang , Wenkang Hu , Jian Huang , Yuling Jiao , Xu Liu

Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach

Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the…

机器学习 · 统计学 2025-05-09 Qian Peng , Yajie Bao , Haojie Ren , Zhaojun Wang , Changliang Zou

Learning Linearized Models from Nonlinear Systems under Initialization Constraints with Finite Data

The identification of a linear system model from data has wide applications in control theory. The existing work that provides finite sample guarantees for linear system identification typically uses data from a single long system…

机器学习 · 统计学 2025-05-09 Lei Xin , Baike She , Qi Dou , George Chiu , Shreyas Sundaram

Physics-Informed Sylvester Normalizing Flows for Bayesian Inference in Magnetic Resonance Spectroscopy

Magnetic resonance spectroscopy (MRS) is a non-invasive technique to measure the metabolic composition of tissues, offering valuable insights into neurological disorders, tumor detection, and other metabolic dysfunctions. However, accurate…

机器学习 · 统计学 2025-05-09 Julian P. Merkofer , Dennis M. J. van de Sande , Alex A. Bhogal , Ruud J. G. van Sloun

Rejection via Learning Density Ratios

Classification with rejection emerges as a learning paradigm which allows models to abstain from making predictions. The predominant approach is to alter the supervised learning pipeline by augmenting typical loss functions, letting model…

机器学习 · 统计学 2025-05-09 Alexander Soen , Hisham Husain , Philip Schulz , Vu Nguyen

A Tutorial on Discriminative Clustering and Mutual Information

To cluster data is to separate samples into distinctive groups that should ideally have some cohesive properties. Today, numerous clustering algorithms exist, and their differences lie essentially in what can be perceived as ``cohesive…

机器学习 · 统计学 2025-05-08 Louis Ohl , Pierre-Alexandre Mattei , Frédéric Precioso

Categorical and geometric methods in statistical, manifold, and machine learning

We present and discuss applications of the category of probabilistic morphisms, initially developed in \cite{Le2023}, as well as some geometric methods to several classes of problems in statistical, machine and manifold learning which shall…

机器学习 · 统计学 2025-05-08 Hông Vân Lê , Hà Quang Minh , Frederic Protin , Wilderich Tuschmann

Cer-Eval: Certifiable and Cost-Efficient Evaluation Framework for LLMs

As foundation models continue to scale, the size of trained models grows exponentially, presenting significant challenges for their evaluation. Current evaluation practices involve curating increasingly large datasets to assess the…

机器学习 · 统计学 2025-05-08 Ganghua Wang , Zhaorun Chen , Bo Li , Haifeng Xu

Multi-modal cascade feature transfer for polymer property prediction

In this paper, we propose a novel transfer learning approach called multi-modal cascade model with feature transfer for polymer property prediction.Polymers are characterized by a composite of data in several different formats, including…

机器学习 · 统计学 2025-05-08 Kiichi Obuchi , Yuta Yahagi , Kiyohiko Toyama , Shukichi Tanaka , Kota Matsui

Ranked differences Pearson correlation dissimilarity with an application to electricity users time series clustering

Time series clustering is an unsupervised learning method for classifying time series data into groups with similar behavior. It is used in applications such as healthcare, finance, economics, energy, and climate science. Several time…

机器学习 · 统计学 2025-05-08 Chutiphan Charoensuk , Nathakhun Wiroonsri

On Understanding Attention-Based In-Context Learning for Categorical Data

In-context learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks,…

机器学习 · 统计学 2025-05-08 Aaron T. Wang , William Convertino , Xiang Cheng , Ricardo Henao , Lawrence Carin

Thermodynamic limit in learning period three

A continuous one-dimensional map with period three includes all periods. This raises the following question: Can we obtain any types of periodic orbits solely by learning three data points? In this paper, we report the answer to be yes.…

机器学习 · 统计学 2025-05-08 Yuichiro Terasaki , Kohei Nakajima

Transport meets Variational Inference: Controlled Monte Carlo Diffusions

Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of the…

机器学习 · 统计学 2025-05-08 Francisco Vargas , Shreyas Padhy , Denis Blessing , Nikolas Nüsken

Tight Regret Bounds for Bayesian Optimization in One Dimension

We consider the problem of Bayesian optimization (BO) in one dimension, under a Gaussian process prior and Gaussian sampling noise. We provide a theoretical analysis showing that, under fairly mild technical assumptions on the kernel, the…

机器学习 · 统计学 2025-05-08 Jonathan Scarlett

Actor-Critics Can Achieve Optimal Sample Efficiency

Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work…

机器学习 · 统计学 2025-05-07 Kevin Tan , Wei Fan , Yuting Wei

Decision Making under Model Misspecification: DRO with Robust Bayesian Ambiguity Sets

Distributionally Robust Optimisation (DRO) protects risk-averse decision-makers by considering the worst-case risk within an ambiguity set of distributions based on the empirical distribution or a model. To further guard against finite,…

机器学习 · 统计学 2025-05-07 Charita Dellaporta , Patrick O'Hara , Theodoros Damoulas

Lower Bounds for Greedy Teaching Set Constructions

A fundamental open problem in learning theory is to characterize the best-case teaching dimension $\operatorname{TS}_{\min}$ of a concept class $\mathcal{C}$ with finite VC dimension $d$. Resolving this problem will, in particular, settle…

机器学习 · 统计学 2025-05-07 Spencer Compton , Chirag Pabbaraju , Nikita Zhivotovskiy

A Symbolic and Statistical Learning Framework to Discover Bioprocessing Regulatory Mechanism: Cell Culture Example

Bioprocess mechanistic modeling is essential for advancing intelligent digital twin representation of biomanufacturing, yet challenges persist due to complex intracellular regulation, stochastic system behavior, and limited experimental…

机器学习 · 统计学 2025-05-07 Keilung Choy , Wei Xie , Keqi Wang