Related papers: On Robust Aggregation for Distributed Data

A Massive Data Framework for M-Estimators with Cubic-Rate

The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing…

Statistics Theory · Mathematics 2017-04-06 Chengchun Shi , Wenbin Lu , Rui Song

Distributed Statistical Inference for Massive Data

This paper considers distributed statistical inference for general symmetric statistics %that encompasses the U-statistics and the M-estimators in the context of massive data where the data can be stored at multiple platforms in different…

Statistics Theory · Mathematics 2018-05-30 Song Xi Chen , Liuhua Peng

Universal Robust Regression via Maximum Mean Discrepancy

Many modern datasets are collected automatically and are thus easily contaminated by outliers. This led to a regain of interest in robust estimation, including new notions of robustness such as robustness to adversarial contamination of the…

Statistics Theory · Mathematics 2023-05-05 Pierre Alquier , Mathieu Gerber

Hierarchical Aggregation Approach for Distributed clustering of spatial datasets

In this paper, we present a new approach of distributed clustering for spatial datasets, based on an innovative and efficient aggregation technique. This distributed approach consists of two phases: 1) local clustering phase, where each…

Databases · Computer Science 2018-02-05 Malika Bendechache , Nhien-An Le-Khac , M-Tahar Kechadi

Bootstrap Model Aggregation for Distributed Statistical Learning

In distributed, or privacy-preserving learning, we are often given a set of probabilistic models estimated from different local repositories, and asked to combine them into a single model that gives efficient statistical estimation. A…

Machine Learning · Statistics 2017-03-01 Jun Han , Qiang Liu

Distributed Adaptive Huber Regression

Distributed data naturally arise in scenarios involving multiple sources of observations, each stored at a different location. Directly pooling all the data together is often prohibited due to limited bandwidth and storage, or due to…

Methodology · Statistics 2021-07-07 Jiyu Luo , Qiang Sun , Wenxin Zhou

A Unified Approach to Robust Mean Estimation

In this paper, we develop connections between two seemingly disparate, but central, models in robust statistics: Huber's epsilon-contamination model and the heavy-tailed noise model. We provide conditions under which this connection…

Machine Learning · Statistics 2019-07-03 Adarsh Prasad , Sivaraman Balakrishnan , Pradeep Ravikumar

Heterogeneity-aware and communication-efficient distributed statistical inference

In multicenter research, individual-level data are often protected against sharing across sites. To overcome the barrier of data sharing, many distributed algorithms, which only require sharing aggregated information, have been developed.…

Methodology · Statistics 2021-03-25 Rui Duan , Yang Ning , Yong Chen

Dealing with bad apples: Robust range-based network localization via distributed relaxation methods

Real-world network applications must cope with failing nodes, malicious attacks, or, somehow, nodes facing corrupted data --- classified as outliers. One enabling application is the geographic localization of the network nodes. However,…

Optimization and Control · Mathematics 2016-10-31 Cláudia Soares , João Gomes

Variance-based Clustering Technique for Distributed Data Mining Applications

Nowadays, huge amounts of data are naturally collected in distributed sites due to different facts and moving these data through the network for extracting useful knowledge is almost unfeasible for either technical reasons or policies.…

Databases · Computer Science 2017-03-30 Lamine M. Aouad , Nhien-An Le-Khac , Tahar Kechadi

Statistical inference using Regularized M-estimation in the reproducing kernel Hilbert space for handling missing data

Imputation and propensity score weighting are two popular techniques for handling missing data. We address these problems using the regularized M-estimation techniques in the reproducing kernel Hilbert space. Specifically, we first use the…

Methodology · Statistics 2021-07-16 Hengfang Wang , Jae Kwang Kim

A General Decision Theory for Huber's $\epsilon$-Contamination Model

Today's data pose unprecedented challenges to statisticians. It may be incomplete, corrupted or exposed to some unknown source of contamination. We need new methods and theories to grapple with these challenges. Robust estimation is one of…

Statistics Theory · Mathematics 2017-01-17 Mengjie Chen , Chao Gao , Zhao Ren

Tk-merge: Computationally Efficient Robust Clustering Under General Assumptions

We address general-shaped clustering problems under very weak parametric assumptions with a two-step hybrid robust clustering algorithm based on trimmed k-means and hierarchical agglomeration. The algorithm has low computational complexity…

Methodology · Statistics 2022-01-19 Luca Insolia , Domenico Perrotta

Spectra: Robust Estimation of Distribution Functions in Networks

Distributed aggregation allows the derivation of a given global aggregate property from many individual local values in nodes of an interconnected network system. Simple aggregates such as minima/maxima, counts, sums and averages have been…

Distributed, Parallel, and Cluster Computing · Computer Science 2012-04-09 Miguel Borges , Paulo Jesus , Carlos Baquero , Paulo Sérgio Almeida

On a Distributed Approach for Density-based Clustering

Efficient extraction of useful knowledge from these data is still a challenge, mainly when the data is distributed, heterogeneous and of different quality depending on its corresponding local infrastructure. To reduce the overhead cost,…

Databases · Computer Science 2017-04-17 Nhien-An Le-Khac , M-Tahar Kechadi

Efficient Large Scale Clustering based on Data Partitioning

Clustering techniques are very attractive for extracting and identifying patterns in datasets. However, their application to very large spatial datasets presents numerous challenges such as high-dimensionality data, heterogeneity, and high…

Databases · Computer Science 2018-02-27 Malika Bendechache , Nhien-An Le-Khac , M-Tahar Kechadi

Optimal Robust Estimation under Local and Global Corruptions: Stronger Adversary and Smaller Error

Algorithmic robust statistics has traditionally focused on the contamination model where a small fraction of the samples are arbitrarily corrupted. We consider a recent contamination model that combines two kinds of corruptions: (i) small…

Data Structures and Algorithms · Computer Science 2024-10-23 Thanasis Pittas , Ankit Pensia

Distributed estimation of spiked eigenvalues in spiked population models

The proliferation of science and technology has led to the prevalence of voluminous data sets that are distributed across multiple machines. It is an established fact that conventional statistical methodologies may be unfeasible in the…

Statistics Theory · Mathematics 2023-10-24 Lu Yan , Jiang Hu

Offline Change Detection under Contamination

In this work, we propose a non-parametric and robust change detection algorithm to detect multiple change points in time series data under contamination. The contamination model is sufficiently general, in that, the most common model used…

Methodology · Statistics 2022-06-24 Sujay Bhatt , Guanhua Fang , Ping Li

Clustering Large Data Sets with Incremental Estimation of Low-density Separating Hyperplanes

An efficient method for obtaining low-density hyperplane separators in the unsupervised context is proposed. Low density separators can be used to obtain a partition of a set of data based on their allocations to the different sides of the…

Machine Learning · Statistics 2021-08-10 David P. Hofmeyr