Related papers: Approximating quantiles in very large datasets

An Experimental Study of Distributed Quantile Estimation

Quantiles are very important statistics information used to describe the distribution of datasets. Given the quantiles of a dataset, we can easily know the distribution of the dataset, which is a fundamental problem in data analysis.…

Databases · Computer Science 2015-08-25 Zixuan Zhuang

A Survey of Approximate Quantile Computation on Large-scale Data (Technical Report)

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation…

Data Structures and Algorithms · Computer Science 2020-06-29 Zhiwei Chen , Aoqian Zhang

An efficient K-means algorithm for Massive Data

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm…

Machine Learning · Statistics 2016-05-11 Marco Capó , Aritz Pérez , José Antonio Lozano

Computation of extreme heat waves in climate models using a large deviation algorithm

Studying extreme events and how they evolve in a changing climate is one of the most important current scientific challenges. Starting from complex climate models, a key difficulty is to be able to run long enough simulations in order to…

Atmospheric and Oceanic Physics · Physics 2017-12-27 Francesco Ragone , Jeroen Wouters , Freddy Bouchet

Proximal algorithms for large-scale statistical modeling and sensor/actuator selection

Several problems in modeling and control of stochastically-driven dynamical systems can be cast as regularized semi-definite programs. We examine two such representative problems and show that they can be formulated in a similar manner. The…

Optimization and Control · Mathematics 2019-12-30 Armin Zare , Hesameddin Mohammadi , Neil K. Dhingra , Tryphon T. Georgiou , Mihailo R. Jovanović

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions.…

Databases · Computer Science 2020-08-25 Kexin Rong , Yao Lu , Peter Bailis , Srikanth Kandula , Philip Levis

A fast quantum mechanical algorithm for estimating the median

Consider the problem of estimating the median of N items to a precision epsilon, i.e., the estimate should be such that, with a high probability, the number of items, with values both smaller than and larger than this estimate, is less than…

Quantum Physics · Physics 2007-05-23 Lov K. Grover

Computation of extremes values of time averaged observables in climate models with large deviation techniques

One of the goals of climate science is to characterize the statistics of extreme and potentially dangerous events in the present and future climate. Extreme events like heat waves, droughts, or floods due to persisting rains are…

Atmospheric and Oceanic Physics · Physics 2020-01-08 Francesco Ragone , Freddy Bouchet

Calibrated Forecasts of Quasi-Periodic Climate Processes with Deep Echo State Networks and Penalized Quantile Regression

Among the most relevant processes in the Earth system for human habitability are quasi-periodic, ocean-driven multi-year events whose dynamics are currently incompletely characterized by physical models, and hence poorly predictable. This…

Atmospheric and Oceanic Physics · Physics 2023-08-09 Matthew Bonas , Christopher K. Wikle , Stefano Castruccio

Data Partitioning View of Mining Big Data

There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns…

Databases · Computer Science 2016-11-30 Shichao Zhang

Adaptive tempering schedules with approximative intermediate measures for filtering problems

Data assimilation algorithms integrate prior information from numerical model simulations with observed data. Ensemble-based filters, regarded as state-of-the-art, are widely employed for large-scale estimation tasks in disciplines such as…

Numerical Analysis · Mathematics 2024-05-24 Iris Rammelmüller , Gottfried Hastermann , Jana de Wiljes

Accelerating data-driven algorithm selection for combinatorial partitioning problems

Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting…

Machine Learning · Computer Science 2025-12-04 Vaggos Chatziafratis , Ishani Karmarkar , Yingxi Li , Ellen Vitercik

Collaborative Prediction: To Join or To Disjoin Datasets

With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even…

Machine Learning · Statistics 2025-06-16 Kyung Rok Kim , Yansong Wang , Xiaocheng Li , Guanting Chen

Principal Component Analysis and Higher Correlations for Distributed Data

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We…

Data Structures and Algorithms · Computer Science 2014-07-01 Ravindran Kannan , Santosh Vempala , David Woodruff

Approximation algorithms for stochastic clustering

We consider stochastic settings for clustering, and develop provably-good approximation algorithms for a number of these notions. These algorithms yield better approximation ratios compared to the usual deterministic clustering setting.…

Data Structures and Algorithms · Computer Science 2023-10-13 David G. Harris , Shi Li , Thomas Pensyl , Aravind Srinivasan , Khoa Trinh

A sampling-based approach for efficient clustering in large datasets

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

Visualization of Big Spatial Data using Coresets for Kernel Density Estimates

The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for…

Human-Computer Interaction · Computer Science 2017-09-14 Yan Zheng , Yi Ou , Alexander Lex , Jeff M. Phillips

Towards Machine Wald

The past century has seen a steady increase in the need of estimating and predicting complex systems and making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation of…

Statistics Theory · Mathematics 2017-01-13 Houman Owhadi , Clint Scovel

Hard-Constrained Deep Learning for Climate Downscaling

The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by…

Atmospheric and Oceanic Physics · Physics 2024-03-04 Paula Harder , Alex Hernandez-Garcia , Venkatesh Ramesh , Qidong Yang , Prasanna Sattigeri , Daniela Szwarcman , Campbell Watson , David Rolnick

Data-driven multiscale modeling for correcting dynamical systems

We propose a multiscale approach for predicting quantities in dynamical systems which is explicitly structured to extract information in both fine-to-coarse and coarse-to-fine directions. We envision this method being generally applicable…

Atmospheric and Oceanic Physics · Physics 2025-12-30 Karl Otness , Laure Zanna , Joan Bruna