应用统计
One of the first fully quantitative distance matrix visualization methods was proposed by Jan Czekanowski at the beginning of the previous century. Recently, a software package, RMaCzek, was made available that allows for producing such…
Analyses of spectral data often assume a linear mixing hypothesis, which states that the spectrum of a mixed substance is approximately the mixture of the individual spectra of its constituent parts. We evaluate this hypothesis in the…
The time-series data of sea level rise and fall contains crucial information on the variability of sea level patterns. Traditional $k$-means clustering is commonly used for categorizing regional variability of sea level, however, its…
Multi-arm randomization has increasingly widespread applications recently and it is also crucial to ensure that the distributions of important observed covariates as well as the potential unobserved covariates are similar and comparable…
Climate change communication is crucial to raising awareness and motivating action. In the context of breaching the limits set out by the Paris Agreement, we argue that climate scientists should move away from point estimates and towards…
Tree-based learning methods such as Random Forest and XGBoost are still the gold-standard prediction methods for tabular data. Feature importance measures are usually considered for feature selection as well as to assess the effect of…
The assumption of fingerprint uniqueness is foundational in forensic science and central to criminal identification practices. However, empirical evidence supporting this assumption is limited, and recent findings from artificial…
Detection of abrupt spatial changes in physical properties representing unique geometric features such as buried objects, cavities, and fractures is an important problem in geophysics and many engineering disciplines. In this context,…
An accurate and timely assessment of wind speed and energy output allows an efficient planning and management of this resource on the power grid. Wind energy, especially at high resolution, calls for the development of nonlinear statistical…
Flexible machine learning tools are increasingly used to estimate heterogeneous treatment effects. This paper gives an accessible tutorial demonstrating the use of the causal forest algorithm, available in the R package grf. We start with a…
Over the last few years, there has been a growing interest in the prediction and modelling of competitive sports outcomes, with particular emphasis placed on this area by the Bayesian statistics and machine learning communities. In this…
We compare conversion rates of association football (soccer) penalties during regulation or extra time with those during shootouts. Our data consists of roughly 50,000 penalties from the eleven~most recent seasons in European men's football…
Understanding the oscillating behaviors that govern organisms' internal biological processes requires interdisciplinary efforts combining both biological and computer experiments, as the latter can complement the former by simulating…
The advent of artificial intelligence (AI) technologies has significantly changed many domains, including applied statistics. This review and vision paper explores the evolving role of applied statistics in the AI era, drawing from our…
Principal Component Analysis (PCA) is one of the most used tools for extracting low-dimensional representations of data, in particular for time series. Performances are known to strongly depend on the quality (amount of noise) and the…
Nowadays, weather forecasts are commonly generated by ensemble forecasts based on multiple runs of numerical weather prediction models. However, such forecasts are usually miscalibrated and/or biased, thus require statistical…
Many existing approaches to generalizing statistical inference amidst distribution shift operate under the covariate shift assumption, which posits that the conditional distribution of unobserved variables given observable ones is invariant…
Honour based abuse covers a wide range of family abuse including female genital mutilation and forced marriage. Safeguarding professionals need to identify where abuses are happening in their local community to best support those at risk of…
FDA's Project Optimus initiative for oncology drug development emphasizes selecting a dose that optimizes both efficacy and safety. When an inferentially adaptive Phase 2/3 design with dose selection is implemented to comply with the…
Biclustering has gained interest in gene expression data analysis due to its ability to identify groups of samples that exhibit similar behaviour in specific subsets of genes (or vice versa), in contrast to traditional clustering methods…