机器学习
Modern Bayesian optimization and adaptive sampling methods increasingly rely on nonlinear parametric models, yet theoretical guarantees for such models under adaptive data collection remain limited. Existing analyses largely focus on…
Discrete automated processes in industrial and cyber-physical systems often exhibit a repetitive structure in which successive repetitions follow a common trajectory while differing in duration, amplitude, and fine-scale dynamics. Such…
Artificial intelligence (AI) has transformed imaging inverse problems, from medical diagnostics to Earth observation. Yet deep neural networks can produce hallucinations, realistic-looking but incorrect details, undermining their…
This paper introduces an algorithm-agnostic approach to feature-based time series clustering via amortized neural inference. By training neural networks to approximate the optimal partitioning rule from simulated data, the proposed…
Determinantal point processes (DPPs) have emerged as a kernelized alternative to vanilla independent sampling for generating efficient minibatches, coresets and other parsimonious representations of large-scale datasets. While theoretical…
Density estimation in high-dimensional settings is an important and challenging statistical problem.Traditional methods based on kernel smoothing are inefficient in high dimensions due to the difficulties in specifying appropriate…
We propose Coreset-Induced Conditional Velocity Flow Matching (CCVFM), a generative model that augments hierarchical rectified flow with a data-informed source distribution. Hierarchical flow matching models the full conditional velocity…
LLM-enabled AI workflows increasingly produce outputs through iterative generate-evaluate-revise loops. Each iteration can improve the candidate, but it also creates a release decision: when to stop and output the current result? This…
Weak-to-strong (W2S) generalization, in which a strong model is fine-tuned on outputs of a weaker, task-specialized model, has been proposed as an approach to aligning superhuman AI systems. Existing theoretical analyses either fix the…
Experimental design has emerged as a powerful approach for improving the sample efficiency of A/B testing, yet existing designs rely critically on correctly specified models. We study robust sequential experimental design under model…
Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with fully…
Length-dependent logit rescaling is widely used to stabilize long-context self-attention, but existing analyses and methods suggest conflicting inverse-temperature laws for the context length $n$, ranging from $(\log n)^{1/2}$ to $\log n$…
Conformal prediction provides a principled framework for uncertainty quantification with finite-sample coverage guarantees. While recent work has extended conformal prediction to online and sequential settings, existing methods typically…
Missing data imputation, where a model is trained on observed data to estimate unobserved values, is a fundamental problem in machine learning. In this paper, we rigorously formulate imputation model learning as a mean-squared error risk…
We study a data-dependent notion of diffusion-model generalization: when a model does not memorize the training set, where do its generated samples go relative to the geometry induced by the data? To answer this, we introduce a…
Generative foundation models trained on tokenized electronic health record (EHR) timelines show promise for clinical outcome prediction via Monte Carlo sampling of simulated future trajectories. However, this approach suffers from three…
We study supervised multiclass classification for diffusion processes, where each class is characterized by a distinct drift function and trajectories are observed at discrete times. We first derive a multidimensional Bayes rule and then…
In many business settings, task-specific labeled data are scarce or costly to obtain, limiting supervised learning on a target task. A classical response is transfer learning (TL). Many TL works study how to transfer information from…
Modern generative AI models, such as diffusion and flow matching models, can sample from rich data distributions. However, many applications, especially in science and engineering, require more than drawing samples from the model…
Uncertainty Quantification (UQ) is paramount for inference in engineering. A common inference task is to recover full-field information of physical systems from a small number of noisy observations, a usually highly ill-posed problem.…