Related papers: Histograms and Wavelets on Probabilistic Data
MapReduce is becoming the de facto framework for storing and processing massive data, due to its excellent scalability, reliability, and elasticity. In many MapReduce applications, obtaining a compact accurate summary of data is essential.…
The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively…
In this paper we consider the wavelet synopsis construction problem without the restriction that we only choose a subset of coefficients of the original data. We provide the first near optimal algorithm. We arrive at the above algorithm by…
Existing decision-theoretic reasoning frameworks such as decision networks use simple data structures and processes. However, decisions are often made based on complex data structures, such as social networks and protein sequences, and rich…
Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a…
The volume of data and the velocity with which it is being generated by com- putational experiments on high performance computing (HPC) systems is quickly outpacing our ability to effectively store this information in its full fidelity.…
Persistent homology is a central methodology in topological data analysis that has been successfully implemented in many fields and is becoming increasingly popular and relevant. The output of persistent homology is a persistence diagram --…
One of the fundamental problems in machine learning is the estimation of a probability distribution from data. Many techniques have been proposed to study the structure of data, most often building around the assumption that observations…
In view of the paradigm shift that makes science ever more data-driven, in this thesis we propose a synthesis method for encoding and managing large-scale deterministic scientific hypotheses as uncertain and probabilistic data. In the form…
Probabilistic relational models provide a well-established formalism to combine first-order logic and probabilistic models, thereby allowing to represent relationships between objects in a relational domain. At the same time, the field of…
In this article we propose a method of performing arithmetic operations on varia-bles with unknown distribution. The approach to the evaluation results of arithme-tic operations can select probability intervals of the algebraic equations…
Without unrealistic continuity and smoothness assumptions on a distributional density of one dimensional dataset, constructing an authentic possibly-gapped histogram becomes rather complex. The candidate ensemble is described via a…
The histogram is an analysis tool in widespread use within many sciences, with high energy physics as a prime example. However, there exists an inherent bias in the choice of binning for the histogram, with different choices potentially…
Data analysis in high-dimensional spaces aims at obtaining a synthetic description of a data set, revealing its main structure and its salient features. We here introduce an approach providing this description in the form of a topography of…
Increasing amounts of available data have led to a heightened need for representing large-scale probabilistic knowledge bases. One approach is to use a probabilistic database, a model with strong assumptions that allow for efficiently…
Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching…
Probabilistic Answer Set Programming under the credal semantics (PASP) extends Answer Set Programming with probabilistic facts that represent uncertain information. The probabilistic facts are discrete with Bernoulli distributions. However,…
Symbolic data analysis has been proposed as a technique for summarising large and complex datasets into a much smaller and tractable number of distributions -- such as random rectangles or histograms -- each describing a portion of the…
Existence of incomplete and imprecise data has moved the database paradigm from deterministic to proba- babilistic information. Probabilistic databases contain tuples that may or may not exist with some probability. As a result, the number…
Biclustering is an unsupervised machine-learning approach aiming to cluster rows and columns simultaneously in a data matrix. Several biclustering algorithms have been proposed for handling numeric datasets. However, real-world data mining…