English
Related papers

Related papers: Approximating quantiles in very large datasets

200 papers

Quantiles are very important statistics information used to describe the distribution of datasets. Given the quantiles of a dataset, we can easily know the distribution of the dataset, which is a fundamental problem in data analysis.…

Databases · Computer Science 2015-08-25 Zixuan Zhuang

As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation…

Data Structures and Algorithms · Computer Science 2020-06-29 Zhiwei Chen , Aoqian Zhang

Due to the progressive growth of the amount of data available in a wide variety of scientific fields, it has become more difficult to ma- nipulate and analyze such information. Even though datasets have grown in size, the K-means algorithm…

Machine Learning · Statistics 2016-05-11 Marco Capó , Aritz Pérez , José Antonio Lozano

Studying extreme events and how they evolve in a changing climate is one of the most important current scientific challenges. Starting from complex climate models, a key difficulty is to be able to run long enough simulations in order to…

Atmospheric and Oceanic Physics · Physics 2017-12-27 Francesco Ragone , Jeroen Wouters , Freddy Bouchet

Several problems in modeling and control of stochastically-driven dynamical systems can be cast as regularized semi-definite programs. We examine two such representative problems and show that they can be formulated in a similar manner. The…

Optimization and Control · Mathematics 2019-12-30 Armin Zare , Hesameddin Mohammadi , Neil K. Dhingra , Tryphon T. Georgiou , Mihailo R. Jovanović

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions.…

Databases · Computer Science 2020-08-25 Kexin Rong , Yao Lu , Peter Bailis , Srikanth Kandula , Philip Levis

Consider the problem of estimating the median of N items to a precision epsilon, i.e., the estimate should be such that, with a high probability, the number of items, with values both smaller than and larger than this estimate, is less than…

Quantum Physics · Physics 2007-05-23 Lov K. Grover

One of the goals of climate science is to characterize the statistics of extreme and potentially dangerous events in the present and future climate. Extreme events like heat waves, droughts, or floods due to persisting rains are…

Atmospheric and Oceanic Physics · Physics 2020-01-08 Francesco Ragone , Freddy Bouchet

Among the most relevant processes in the Earth system for human habitability are quasi-periodic, ocean-driven multi-year events whose dynamics are currently incompletely characterized by physical models, and hence poorly predictable. This…

Atmospheric and Oceanic Physics · Physics 2023-08-09 Matthew Bonas , Christopher K. Wikle , Stefano Castruccio

There are two main approximations of mining big data in memory. One is to partition a big dataset to several subsets, so as to mine each subset in memory. By this way, global patterns can be obtained by synthesizing all local patterns…

Databases · Computer Science 2016-11-30 Shichao Zhang

Data assimilation algorithms integrate prior information from numerical model simulations with observed data. Ensemble-based filters, regarded as state-of-the-art, are widely employed for large-scale estimation tasks in disciplines such as…

Numerical Analysis · Mathematics 2024-05-24 Iris Rammelmüller , Gottfried Hastermann , Jana de Wiljes

Data-driven algorithm selection is a powerful approach for choosing effective heuristics for computational problems. It operates by evaluating a set of candidate algorithms on a collection of representative training instances and selecting…

Machine Learning · Computer Science 2025-12-04 Vaggos Chatziafratis , Ishani Karmarkar , Yingxi Li , Ellen Vitercik

With the recent rise of generative Artificial Intelligence (AI), the need of selecting high-quality dataset to improve machine learning models has garnered increasing attention. However, some part of this topic remains underexplored, even…

Machine Learning · Statistics 2025-06-16 Kyung Rok Kim , Yansong Wang , Xiaocheng Li , Guanting Chen

We consider algorithmic problems in the setting in which the input data has been partitioned arbitrarily on many servers. The goal is to compute a function of all the data, and the bottleneck is the communication used by the algorithm. We…

Data Structures and Algorithms · Computer Science 2014-07-01 Ravindran Kannan , Santosh Vempala , David Woodruff

We consider stochastic settings for clustering, and develop provably-good approximation algorithms for a number of these notions. These algorithms yield better approximation ratios compared to the usual deterministic clustering setting.…

Data Structures and Algorithms · Computer Science 2023-10-13 David G. Harris , Shi Li , Thomas Pensyl , Aravind Srinivasan , Khoa Trinh

We propose a simple and efficient clustering method for high-dimensional data with a large number of clusters. Our algorithm achieves high-performance by evaluating distances of datapoints with a subset of the cluster centres. Our…

Machine Learning · Computer Science 2022-03-30 Georgios Exarchakis , Omar Oubari , Gregor Lenz

The size of large, geo-located datasets has reached scales where visualization of all data points is inefficient. Random sampling is a method to reduce the size of a dataset, yet it can introduce unwanted errors. We describe a method for…

Human-Computer Interaction · Computer Science 2017-09-14 Yan Zheng , Yi Ou , Alexander Lex , Jeff M. Phillips

The past century has seen a steady increase in the need of estimating and predicting complex systems and making (possibly critical) decisions with limited information. Although computers have made possible the numerical evaluation of…

Statistics Theory · Mathematics 2017-01-13 Houman Owhadi , Clint Scovel

The availability of reliable, high-resolution climate and weather data is important to inform long-term decisions on climate adaptation and mitigation and to guide rapid responses to extreme events. Forecasting models are limited by…

We propose a multiscale approach for predicting quantities in dynamical systems which is explicitly structured to extract information in both fine-to-coarse and coarse-to-fine directions. We envision this method being generally applicable…

Atmospheric and Oceanic Physics · Physics 2025-12-30 Karl Otness , Laure Zanna , Joan Bruna
‹ Prev 1 2 3 10 Next ›