Related papers: Plotting the Differences Between Data and Expectat…

Plots of the cumulative differences between observed and expected values of ordered Bernoulli variates

Many predictions are probabilistic in nature; for example, a prediction could be for precipitation tomorrow, but with only a 30 percent chance. Given both the predictions and the actual outcomes, "reliability diagrams" (also known as…

Methodology · Statistics 2020-07-20 Mark Tygert

Histogram binning revisited with a focus on human perception

This paper presents a quantitative user study to evaluate how well users can visually perceive the underlying data distribution from a histogram representation. We used different sample and bin sizes and four different distributions…

Human-Computer Interaction · Computer Science 2021-09-15 Raphael Sahann , Torsten Möller , Johanna Schmidt

The Essential Histogram

The histogram is widely used as a simple, exploratory display of data, but it is usually not clear how to choose the number and size of bins. We construct a confidence set of distribution functions that optimally address the two main tasks…

Statistics Theory · Mathematics 2020-02-13 Housen Li , Axel Munk , Hannes Sieling , Guenther Walther

Metrics of calibration for probabilistic predictions

Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose…

Statistics Theory · Mathematics 2022-11-15 Imanol Arrieta-Ibarra , Paman Gujral , Jonathan Tannen , Mark Tygert , Cherie Xu

Optimal Data-Based Binning for Histograms

Histograms are convenient non-parametric density estimators, which continue to be used ubiquitously. Summary quantities estimated from histogram-based probability density models depend on the choice of the number of bins. We introduce a…

Data Analysis, Statistics and Probability · Physics 2013-09-17 Kevin H. Knuth

A method for statistical comparison of histograms

We propose an approach for testing the hypothesis that two realizations of the random variables in the form of histograms are taken from the same statistical population (i.e. that two histograms are drawn from the same distribution). The…

Data Analysis, Statistics and Probability · Physics 2013-05-22 Sergey Bityukov , Nikolai Krasnikov , Alexander Nikitenko , Vera Smirnova

On the distinguishability of histograms

We consider an approach for testing the hypothesis that two realizations of the random variables in the form of histograms are taken from the same statistical population (i.e. two histograms are drawn from the same distribution). The…

Data Analysis, Statistics and Probability · Physics 2013-11-26 S. Bityukov , N. Krasnikov , A. Nikitenko , V. Smirnova

Data analysis recipes: Choosing the binning for a histogram

Data points are placed in bins when a histogram is created, but there is always a decision to be made about the number or width of the bins. This decision is often made arbitrarily or subjectively, but it need not be. A jackknife or…

Data Analysis, Statistics and Probability · Physics 2008-07-31 David W. Hogg

Cumulative deviation of a subpopulation from the full population

Assessing equity in treatment of a subpopulation often involves assigning numerical "scores" to all individuals in the full population such that similar individuals get similar scores; matching via propensity scores or appropriate…

Methodology · Statistics 2021-10-18 Mark Tygert

Resolving Histogram Binning Dilemmas with Binless and Binfull Algorithms

The histogram is an analysis tool in widespread use within many sciences, with high energy physics as a prime example. However, there exists an inherent bias in the choice of binning for the histogram, with different choices potentially…

Data Analysis, Statistics and Probability · Physics 2014-05-21 Abram Krislock , Nathan Krislock

Differentiable Histogram with Hard-Binning

The simplicity and expressiveness of a histogram render it a useful feature in different contexts including deep learning. Although the process of computing a histogram is non-differentiable, researchers have proposed differentiable…

Machine Learning · Computer Science 2020-12-14 Ibrahim Yusuf , George Igwegbe , Oluwafemi Azeez

A graphical method of cumulative differences between two subpopulations

Comparing the differences in outcomes (that is, in "dependent variables") between two subpopulations is often most informative when comparing outcomes only for individuals from the subpopulations who are similar according to "independent…

Methodology · Statistics 2021-12-20 Mark Tygert

Preserving Statistical Validity in Adaptive Data Analysis

A great deal of effort has been devoted to reducing the risk of spurious scientific discoveries, from the use of sophisticated validation techniques, to deep statistical methods for controlling the false discovery rate in multiple…

Machine Learning · Computer Science 2016-03-03 Cynthia Dwork , Vitaly Feldman , Moritz Hardt , Toniann Pitassi , Omer Reingold , Aaron Roth

Do We Really Even Need Data? A Modern Look at Drawing Inference with Predicted Data

As artificial intelligence and machine learning tools become more accessible, and scientists face new obstacles to data collection (e.g., rising costs, declining survey response rates), researchers increasingly use predictions from…

Machine Learning · Statistics 2025-12-08 Stephen Salerno , Kentaro Hoffman , Awan Afiaz , Anna Neufeld , Tyler H. McCormick , Jeffrey T. Leek

On the number of bins in a rank histogram

Rank histograms are popular tools for assessing the reliability of meteorological ensemble forecast systems. A reliable forecast system leads to a uniform rank histogram, and deviations from uniformity can indicate miscalibrations. However,…

Applications · Statistics 2022-09-30 Claudio Heinrich

"What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two…

Machine Learning · Computer Science 2025-09-24 Varun Babbar , Zhicheng Guo , Cynthia Rudin

The Shannon Entropy of a Histogram

The histogram is a key method for visualizing data and estimating the underlying probability distribution. Incorrect conclusions about the data result from over or under-binning. A new method based on the Shannon entropy of the histogram…

Data Analysis, Statistics and Probability · Physics 2022-10-07 Stephen Watts , Lisa Crow

Statistical computation of Boltzmann entropy and estimation of the optimal probability density function from statistical sample

In this work, we investigate the statistical computation of the Boltzmann entropy of statistical samples. For this purpose, we use both histogram and kernel function to estimate the probability density function of statistical samples. We…

Methodology · Statistics 2015-06-23 Ning Sui , Min Li , Ping He

Histogram lies about distribution shape and Pearson's coefficient of variation lies about variability

Background and Objective: Histograms and Pearson's coefficient of variation are among the most popular summary statistics. Researchers use histograms to judge the shape of quantitative data distribution by visual inspection. The coefficient…

Methodology · Statistics 2022-04-14 Paulo S. P. Silveira , Jose O. Siqueira

Data-driven nonlinear expectations for statistical uncertainty in decisions

In stochastic decision problems, one often wants to estimate the underlying probability measure statistically, and then to use this estimate as a basis for decisions. We shall consider how the uncertainty in this estimation can be…

Statistics Theory · Mathematics 2017-05-24 Samuel N. Cohen