Related papers: Approximate Partition Selection for Big-Data Workl…

A Random Sample Partition Data Model for Big Data Analysis

Big data sets must be carefully partitioned into statistically similar data subsets that can be used as representative samples for big data analysis tasks. In this paper, we propose the random sample partition (RSP) data model to represent…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-06-11 Salman Salloum , Yulin He , Joshua Zhexue Huang , Xiaoliang Zhang , Tamer Z. Emara , Chenghao Wei , Heping He

A Survey of Approximate Quantile Computation on Large-scale Data (Technical Report)

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation…

Data Structures and Algorithms · Computer Science 2020-06-29 Zhiwei Chen , Aoqian Zhang

Storyboard: Optimizing Precomputed Summaries for Aggregation

An emerging class of data systems partition their data and precompute approximate summaries (i.e., sketches and samples) for each segment to reduce query costs. They can then aggregate and combine the segment summaries to estimate results…

Databases · Computer Science 2020-02-11 Edward Gan , Peter Bailis , Moses Charikar

Combining Aggregation and Sampling (Nearly) Optimally for Approximate Query Processing

Sample-based approximate query processing (AQP) suffers from many pitfalls such as the inability to answer very selective queries and unreliable confidence intervals when sample sizes are small. Recent research presented an intriguing…

Databases · Computer Science 2021-03-31 Xi Liang , Stavros Sintos , Zechao Shang , Sanjay Krishnan

Data Partitioning View of Mining Big Data

There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns…

Databases · Computer Science 2016-11-30 Shichao Zhang

StruClus: Structural Clustering of Large-Scale Graph Databases

We present a structural clustering algorithm for large-scale datasets of small labeled graphs, utilizing a frequent subgraph sampling strategy. A set of representatives provides an intuitive description of each cluster, supports the…

Databases · Computer Science 2016-10-03 Till Schäfer , Petra Mutzel

Approximate Computation for Big Data Analytics

Over the past a few years, research and development has made significant progresses on big data analytics. A fundamental issue for big data analytics is the efficiency. If the optimal solution is unable to attain or not required or has a…

Databases · Computer Science 2019-01-03 Shuai Ma , Jinpeng Huai

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

On the variance of subset sum estimation

For high volume data streams and large data warehouses, sampling is used for efficient approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing with a set of weighted items and want to support queries…

Data Structures and Algorithms · Computer Science 2007-05-23 Mario Szegedy , Mikkel Thorup

Approximate Queries and Representations for Large Data Sequences

Many new database application domains such as experimental sciences and medicine are characterized by large sequences as their main form of data. Using approximate representation can significantly reduce the required storage and search…

Databases · Computer Science 2019-04-22 Hagit Shatkay , Stanley B. Zdonik

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…

Methodology · Statistics 2023-04-14 Shuyuan Wu , Xuening Zhu , Hansheng Wang

Approximate Query Processing over Static Sets and Sliding Windows

Indexing of static and dynamic sets is fundamental to a large set of applications such as information retrieval and caching. Denoting the characteristic vector of the set by B, we consider the problem of encoding sets and multisets to…

Data Structures and Algorithms · Computer Science 2018-09-17 Ran Ben Basat , Seungbum Jo , Srinivasa Rao Satti , Shubham Ugare

Approximating quantiles in very large datasets

Very large datasets are often encountered in climatology, either from a multiplicity of observations over time and space or outputs from deterministic models (sometimes in petabytes= 1 million gigabytes). Loading a large data vector and…

Computation · Statistics 2010-07-08 Reza Hosseini

Query Processing on Large Graphs: Approaches To Scalability and Response Time Trade Offs

With the advent of social networks and the web, the graph sizes have grown too large to fit in main memory precipitating the need for alternative approaches for an efficient, scalable evaluation of queries on graphs of any size. Here, we…

Databases · Computer Science 2019-05-15 Soumyava Das , Abhishek Santra , Jay Bodra , Sharma Chakravarthy

Approximate Cluster-Based Sparse Document Retrieval with Segmented Maximum Term Weights

This paper revisits cluster-based retrieval that partitions the inverted index into multiple groups and skips the index partially at cluster and document levels during online inference using a learned sparse representation. It proposes an…

Information Retrieval · Computer Science 2024-04-16 Yifan Qiao , Shanxiu He , Yingrui Yang , Parker Carlson , Tao Yang

Multi-resolution subsampling for large-scale linear classification

Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information…

Methodology · Statistics 2024-07-10 Haolin Chen , Holger Dette , Jun Yu

An Experimental Study of Distributed Quantile Estimation

Quantiles are very important statistics information used to describe the distribution of datasets. Given the quantiles of a dataset, we can easily know the distribution of the dataset, which is a fundamental problem in data analysis.…

Databases · Computer Science 2015-08-25 Zixuan Zhuang

Diversity Subsampling: Custom Subsamples from Large Data Sets

Subsampling from a large data set is useful in many supervised learning contexts to provide a global view of the data based on only a fraction of the observations. Diverse (or space-filling) subsampling is an appealing subsampling approach…

Methodology · Statistics 2023-11-27 Boyang Shang , Daniel W. Apley , Sanjay Mehrotra

Histogram Sort with Sampling

To minimize data movement, state-of-the-art parallel sorting algorithms use techniques based on sampling and histogramming to partition keys prior to redistribution. Sampling enables partitioning to be done using a representative subset of…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-04-30 Vipul Harsh , Laxmikant Kale , Edgar Solomonik

STULL: Unbiased Online Sampling for Visual Exploration of Large Spatiotemporal Data

Online sampling-supported visual analytics is increasingly important, as it allows users to explore large datasets with acceptable approximate answers at interactive rates. However, existing online spatiotemporal sampling techniques are…

Databases · Computer Science 2020-09-01 Guizhen Wang , Jingjing Guo , Mingjie Tang , José Florencio de Queiroz Neto , Calvin Yau , Anas Daghistani , Morteza Karimzadeh , Walid G. Aref , David S. Ebert