Related papers: A sub-sampling algorithm preventing outliers

Meta-Learning for Unsupervised Outlier Detection with Optimal Transport

Automated machine learning has been widely researched and adopted in the field of supervised classification and regression, but progress in unsupervised settings has been limited. We propose a novel approach to automate outlier detection…

Machine Learning · Computer Science 2024-09-10 Prabhant Singh , Joaquin Vanschoren

D-optimal Subsampling Design for Massive Data Linear Regression

Data reduction is a fundamental challenge of modern technology, where classical statistical methods are not applicable because of computational limitations. We consider multiple linear regression for an extraordinarily large number of…

Methodology · Statistics 2025-05-30 Torsten Glemser , Rainer Schwabe

Optimal subsampling designs

Subsampling is commonly used to overcome computational and economical bottlenecks in the analysis of finite populations and massive datasets. Existing methods are often limited in scope and use optimality criteria (e.g., A-optimality) with…

Statistics Theory · Mathematics 2023-04-07 Henrik Imberg , Marina Axelson-Fisk , Johan Jonasson

Balanced Subsampling for Big Data with Categorical Covariates

Supervised learning under measurement constraints is a common challenge in statistical and machine learning. In many applications, despite extensive design points, acquiring responses for all points is often impractical due to resource…

Methodology · Statistics 2025-03-19 Lin Wang

On the selection of optimal subdata for big data regression based on leverage scores

The demand of computational resources for the modeling process increases as the scale of the datasets does, since traditional approaches for regression involve inverting huge data matrices. The main problem relies on the large data size,…

Methodology · Statistics 2023-07-06 Vasilis Chasiotis , Dimitris Karlis

Linear-time Outlier Detection via Sensitivity

Outliers are ubiquitous in modern data sets. Distance-based techniques are a popular non-parametric approach to outlier detection as they require no prior assumptions on the data generating distribution and are simple to implement. Scaling…

Machine Learning · Statistics 2016-05-04 Mario Lucic , Olivier Bachem , Andreas Krause

Optimal Sub-sampling with Influence Functions

Sub-sampling is a common and often effective method to deal with the computational challenges of large datasets. However, for most statistical models, there is no well-motivated approach for drawing a non-uniform subsample. We show that the…

Machine Learning · Statistics 2017-09-07 Daniel Ting , Eric Brochu

Optimal subdata selection for linear model selection

If the assumed model does not accurately capture the underlying structure of the data, a statistical method is likely to yield sub-optimal results, and so model selection is crucial in order to conduct any statistical analysis. However, in…

Methodology · Statistics 2023-06-21 Vasilis Chasiotis , Dimitris Karlis

Subdata selection for big data regression: an improved approach

In the big data era researchers face a series of problems. Even standard approaches/methodologies, like linear regression, can be difficult or problematic with huge volumes of data. Traditional approaches for regression in big datasets may…

Methodology · Statistics 2024-11-13 Vasilis Chasiotis , Dimitris Karlis

A Soft Method for Outliers Detection at the Edge of the Network

The combination of the Internet of Things and the Edge Computing gives many opportunities to support innovative applications close to end users. Numerous devices present in both infrastructures can collect data upon which various processing…

Distributed, Parallel, and Cluster Computing · Computer Science 2021-03-02 Kostas Kolomvatsos , Christos Anagnostopoulos

Rethinking Unsupervised Outlier Detection via Multiple Thresholding

In the realm of unsupervised image outlier detection, assigning outlier scores holds greater significance than its subsequent task: thresholding for predicting labels. This is because determining the optimal threshold on non-separable…

Computer Vision and Pattern Recognition · Computer Science 2024-07-16 Zhonghang Liu , Panzhong Lu , Guoyang Xie , Zhichao Lu , Wen-Yan Lin

A Randomized Exchange Algorithm for Computing Optimal Approximate Designs of Experiments

We propose a class of subspace ascent methods for computing optimal approximate designs that covers both existing as well as new and more efficient algorithms. Within this class of methods, we construct a simple, randomized exchange…

Computation · Statistics 2018-01-18 Radoslav Harman , Lenka Filová , Peter Richtárik

Parameter Selection Algorithm For Continuous Variables

In this article, we propose a new algorithm for supervised learning methods, by which one can both capture the non-linearity in data and also find the best subset model. To produce an enhanced subset of the original variables, an ideal…

Applications · Statistics 2017-01-23 Peyman Tavallali , Marianne Razavi , Sean Brady

Changepoint Detection in the Presence of Outliers

Many traditional methods for identifying changepoints can struggle in the presence of outliers, or when the noise is heavy-tailed. Often they will infer additional changepoints in order to fit the outliers. To overcome this problem, data…

Methodology · Statistics 2017-07-12 Paul Fearnhead , Guillem Rigaill

Random Subspace Learning Approach to High-Dimensional Outliers Detection

We introduce and develop a novel approach to outlier detection based on adaptation of random subspace learning. Our proposed method handles both high-dimension low-sample size and traditional low-dimensional high-sample size datasets.…

Machine Learning · Statistics 2015-05-05 Bohan Liu , Ernest Fokoue

Unsupervised Data Selection for Supervised Learning

Recent research put a big effort in the development of deep learning architectures and optimizers obtaining impressive results in areas ranging from vision to language processing. However little attention has been addressed to the need of a…

Computer Vision and Pattern Recognition · Computer Science 2018-12-20 Gabriele Valvano , Andrea Leo , Daniele Della Latta , Nicola Martini , Gianmarco Santini , Dante Chiappino , Emiliano Ricciardi

Constrained D-optimal Design for Paid Research Study

We consider constrained sampling problems in paid research studies or clinical trials. When qualified volunteers are more than the budget allowed, we recommend a D-optimal sampling strategy based on the optimal design theory and develop a…

Methodology · Statistics 2024-05-27 Yifei Huang , Liping Tong , Jie Yang

Detection of Multiple Influential Observations on Model Selection

Outlying observations are frequently encountered across a wide spectrum of scientific domains, posing notable challenges to the generalizability of statistical models and the reproducibility of downstream analysis. They are identified…

Methodology · Statistics 2026-03-17 Dongliang Zhang , Masoud Asgharian , Martin A. Lindquist

Unsupervised Outlier Detection using Random Subspace and Subsampling Ensembles of Dirichlet Process Mixtures

Probabilistic mixture models are recognized as effective tools for unsupervised outlier detection owing to their interpretability and global characteristics. Among these, Dirichlet process mixture models stand out as a strong alternative to…

Machine Learning · Computer Science 2024-07-26 Dongwook Kim , Juyeon Park , Hee Cheol Chung , Seonghyun Jeong

Practical Bayesian optimization in the presence of outliers

Inference in the presence of outliers is an important field of research as outliers are ubiquitous and may arise across a variety of problems and domains. Bayesian optimization is method that heavily relies on probabilistic inference. This…

Machine Learning · Computer Science 2017-12-14 Ruben Martinez-Cantin , Kevin Tee , Michael McCourt