Related papers: Data analysis recipes: Choosing the binning for a …
Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a…
Rank histograms are popular tools for assessing the reliability of meteorological ensemble forecast systems. A reliable forecast system leads to a uniform rank histogram, and deviations from uniformity can indicate miscalibrations. However,…
The histogram is an analysis tool in widespread use within many sciences, with high energy physics as a prime example. However, there exists an inherent bias in the choice of binning for the histogram, with different choices potentially…
When one deals with data drawn from continuous variables, a histogram is often inadequate to display their probability density. It deals inefficiently with statistical noise, and binsizes are free parameters. In contrast to that, the…
When reading peer-reviewed scientific literature describing any analysis of empirical data, it is natural and correct to proceed with the underlying assumption that experiments have made good faith efforts to ensure that their analyses…
The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks…
Symbolic data analysis has been proposed as a technique for summarising large and complex datasets into a much smaller and tractable number of distributions -- such as random rectangles or histograms -- each describing a portion of the…
This paper presents a quantitative user study to evaluate how well users can visually perceive the underlying data distribution from a histogram representation. We used different sample and bin sizes and four different distributions…
Binning is applied to categorize data values or to see distributions of data. Existing binning algorithms often rely on statistical properties of data. However, there are semantic considerations for selecting appropriate binning schemes.…
Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most…
The importance of finding the characteristics leading to either a success or a failure is one of the driving forces of data mining. The various application areas of finding success/failure factors cover vast variety of areas such as credit…
This article proposes a way to improve the presentation of histograms where data are compared to expectation. Sometimes, it is difficult to judge by eye whether the difference between the bin content and the theoretical expectation…
We go through the many considerations involved in fitting a model to data, using as an example the fit of a straight line to a set of points in a two-dimensional plane. Standard weighted least-squares fitting is only appropriate when there…
The simplicity and expressiveness of a histogram render it a useful feature in different contexts including deep learning. Although the process of computing a histogram is non-differentiable, researchers have proposed differentiable…
The histogram is a key method for visualizing data and estimating the underlying probability distribution. Incorrect conclusions about the data result from over or under-binning. A new method based on the Shannon entropy of the histogram…
Conventional statistics begins with a model, and assigns a likelihood of obtaining any particular set of data. The opposite approach, beginning with the data and assigning a likelihood to any particular model, is explored here for the case…
Accurate calibration of probabilistic predictive models learned is critical for many practical prediction and decision-making tasks. There are two main categories of methods for building calibrated classifiers. One approach is to develop…
Algorithmic statistics considers the following problem: given a binary string $x$ (e.g., some experimental data), find a "good" explanation of this data. It uses algorithmic information theory to define formally what is a good explanation.…
Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose…
Context. Visualization of 2D distributions is an essential task, commonly done with a 2D histogram. The histogram is built by subdividing the sample space into regions and color-coding the number of samples in each region. Aims. We aim to…