Related papers: A Random Sample Partition Data Model for Big Data …

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions.…

Databases · Computer Science 2020-08-25 Kexin Rong , Yao Lu , Peter Bailis , Srikanth Kandula , Philip Levis

Robust, scalable and fast bootstrap method for analyzing large scale data

In this paper we address the problem of performing statistical inference for large scale data sets i.e., Big Data. The volume and dimensionality of the data may be so high that it cannot be processed or stored in a single computing node. We…

Methodology · Statistics 2016-04-20 Shahab Basiri , Esa Ollila , Visa Koivunen

Distributed Spatial Data Clustering as a New Approach for Big Data Analysis

In this paper we propose a new approach for Big Data mining and analysis. This new approach works well on distributed datasets and deals with data clustering task of the analysis. The approach consists of two main phases, the first phase…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-03-05 Malika Bendechache , Nhien-An Le-Khac , M-Tahar Kechadi

SPlit: An Optimal Method for Data Splitting

In this article we propose an optimal method referred to as SPlit for splitting a dataset into training and testing sets. SPlit is based on the method of Support Points (SP), which was initially developed for finding the optimal…

Machine Learning · Statistics 2021-05-10 V. Roshan Joseph , Akhil Vakayil

Consistent Estimation for Partition-wise Regression and Classification Models

Partition-wise models offer a flexible approach for modeling complex and multidimensional data that are capable of producing interpretable results. They are based on partitioning the observed data into regions, each of which is modeled with…

Methodology · Statistics 2017-06-07 Rex C. Y. Cheung , Alexander Aue , Thomas C. M. Lee

Randomized Block Proximal Methods for Distributed Stochastic Big-Data Optimization

In this paper we introduce a class of novel distributed algorithms for solving stochastic big-data convex optimization problems over directed graphs. In the addressed set-up, the dimension of the decision variable can be extremely high and…

Optimization and Control · Mathematics 2020-10-06 Francesco Farina , Giuseppe Notarstefano

Random Partitioning and Distribution-based Thresholding for Iterative Variable Screening in High Dimensions

In big data analysis, a simple task such as linear regression can become very challenging as the variable dimension $p$ grows. As a result, variable screening is inevitable in many scientific studies. In recent years, randomized algorithms…

Methodology · Statistics 2019-02-13 Yu-Hsiang Cheng , Tzee-Ming Huang , Su-Yun Huang

Data Partitioning View of Mining Big Data

There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns…

Databases · Computer Science 2016-11-30 Shichao Zhang

Sampled sub-block hashing for large input randomness extraction

Randomness extraction is an essential post-processing step in practical quantum cryptography systems. When statistical fluctuations are taken into consideration, the requirement of large input data size could heavily penalise the speed and…

Quantum Physics · Physics 2024-04-09 Hong Jie Ng , Wen Yu Kon , Ignatius William Primaatmaja , Chao Wang , Charles Lim

Randomized maximum-contrast selection: subagging for large-scale regression

We introduce a very general method for sparse and large-scale variable selection. The large-scale regression settings is such that both the number of parameters and the number of samples are extremely large. The proposed method is based on…

Statistics Theory · Mathematics 2019-07-31 Jelena Bradic

Big Data and model-based survey sampling

Big Data are huge amounts of digital information that are automatically accrued or merged from several sources and rarely result from properly planned surveys. A Big Dataset is herein conceived of as a collection of information concerning a…

Computation · Statistics 2020-02-12 Deldossi Laura , Tommasi Chiara

Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for…

Human-Computer Interaction · Computer Science 2017-09-14 Yan Zheng , Yi Ou , Alexander Lex , Jeff M. Phillips

Score-Matching Representative Approach for Big Data Analysis with Generalized Linear Models

We propose a fast and efficient strategy, called the representative approach, for big data analysis with generalized linear models, especially for distributed data with localization requirements or limited network bandwidth. With a given…

Methodology · Statistics 2021-12-16 Keren Li , Jie Yang

A model robust sub-sampling approach for Generalised Linear Models in Big data settings

In today's modern era of Big data, computationally efficient and scalable methods are needed to support timely insights and informed decision making. One such method is sub-sampling, where a subset of the Big data is analysed and used as…

Methodology · Statistics 2022-09-07 Amalan Mahendran , Helen Thompson , James M. McGree

A Random Finite Set Model for Data Clustering

The goal of data clustering is to partition data points into groups to minimize a given objective function. While most existing clustering algorithms treat each data point as vector, in many applications each datum is not a vector but a…

Machine Learning · Statistics 2017-03-16 Dinh Phung , Ba-Ngu Bo

Clustering in Partially Labeled Stochastic Block Models via Total Variation Minimization

A main task in data analysis is to organize data points into coherent groups or clusters. The stochastic block model is a probabilistic model for the cluster structure. This model prescribes different probabilities for the presence of edges…

Machine Learning · Computer Science 2020-09-24 Alexander Jung

Spatial Random Sampling: A Structure-Preserving Data Sketching Tool

Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low…

Machine Learning · Computer Science 2017-10-11 Mostafa Rahmani , George Atia

Rectangular Bounding Process

Stochastic partition models divide a multi-dimensional space into a number of rectangular regions, such that the data within each region exhibit certain types of homogeneity. Due to the nature of their partition strategy, existing partition…

Machine Learning · Statistics 2019-03-12 Xuhui Fan , Bin Li , Scott Anthony Sisson

Random Partition Models for Microclustering Tasks

Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as…

Methodology · Statistics 2020-04-07 Brenda Betancourt , Giacomo Zanella , Rebecca C. Steorts

A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified

Subsampling is a computationally efficient and scalable method to draw inference in large data settings based on a subset of the data rather than needing to consider the whole dataset. When employing subsampling techniques, a crucial…

Methodology · Statistics 2025-10-08 Amalan Mahendran , Helen Thompson , James M. McGree