Statistics
We introduce Pairwise Distance-Diffusion Analysis (PDDA), a geometric framework for estimating the Hurst exponent from distance plots of long-memory stochastic processes. A single construction yields two complementary routes: R/S-PDDA, a…
We develop an adaptive jump test for discretely observed high-frequency semimartingales by combining the A"it-Sahalia--Jacod ratio statistic (A"it-Sahalia and Jacod, 2009) and the Lee--Mykland extreme-return statistic (Lee and Mykland,…
Physical computing systems provide a promising route toward hardware-native machine learning, but their computational capabilities remain difficult to characterize in a principled, task-independent, and data-efficient way. We extend the…
Informative cluster size (ICS) and informative subgroup size (ISS) can distort marginal association estimates when the number of observed units, or their distribution across outcome-defined categories, is related to the outcomes under…
Data assimilation is the process of estimating the state of a dynamical system over time by combining model predictions with measurements. This task becomes challenging when the system is nonlinear and high-dimensional. To address this,…
We develop Wasserstein-based hypothesis tests for empirical-measure convergence in stationary dependent sequences. For a known candidate invariant measure, $\mu$, we study the statistic $T_n=\sqrt{n}\,W_1(\hat\mu_n,\mu)$ and establish…
Modern approaches for learning from non-Markovian time series, such as recurrent neural networks, neural controlled differential equations or transformers, typically rely on implicit memory mechanisms that can be difficult to interpret or…
p-hacking occurs when researchers conduct multiple significance tests (e.g., p1;H0,1 and p2;H0,2) and then selectively report tests that yield desirable (usually significant) results (e.g., p2 < 0.05;H0,2) without correcting for multiple…
The fate of cities under natural hazards depends not only on hazard intensity but also on the coupling of structural damage, a collective process that remains poorly understood. Here we show that urban structural damage exhibits…
Score-based generative models have recently achieved remarkable success. While they are usually parameterized by the score, an alternative way is to use a series of time-dependent energy-based models (EBMs), where the score is obtained from…
We introduce a formal active learning methodology for guiding the placement of Lagrangian observers to infer time-dependent vector fields -- a key task in oceanography, marine science, and ocean engineering -- using a physics-informed…
Accurate forecasts of weekly mortality are essential for public health and the insurance industry. We develop a forecasting framework that extends the Lee-Carter model with age- and region-specific seasonal effects and penalized distributed…
A popular quantitative approach to evaluating player performance in sports involves comparing an observed outcome to the expected outcome ignoring player involvement, which is estimated using statistical or machine learning methods. In…
Multidimensional factor models with moderations on all model parameters have so far been limited to single-factor and two-factor models. This does not align well with existing psychological measures, which are commonly intended to assess…
Conformalized multiple testing offers a model-free way to control predictive uncertainty in decision-making. Existing methods typically use only part of the available data to build score functions tailored to specific settings. We propose a…
Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where…
Reinforced random walks (RRWs), including vertex-reinforced random walks (VRRWs) and edge-reinforced random walks (ERRWs), model random walks where the transition probabilities evolve based on prior visitation history~\cite{mgr, fmk,…
Open burning of plastic waste may pose a significant threat to global health by degrading air quality, but quantitative research on this problem -- crucial for policy making -- has been stunted by lack of data. Many low- and middle-income…
We study estimation of a class prior for unlabeled target samples which possibly differs from that of source population. Moreover, it is assumed that the source data is partially observable: only samples from the positive class and from the…
We establish novel and general high-dimensional concentration inequalities and Berry-Esseen bounds for vector-valued martingales induced by Markov chains. We apply these results to analyze the performance of the Temporal Difference (TD)…