Related papers: Estimation with Binned Data
Researchers must often estimate income inequality using data that give only the number of cases (e.g., families or households) whose incomes fall in "bins" such as $0-9,999, $10,000-14,999,..., $200,000+. We find that popular methods for…
When reading peer-reviewed scientific literature describing any analysis of empirical data, it is natural and correct to proceed with the underlying assumption that experiments have made good faith efforts to ensure that their analyses…
Power-law probability distributions arise often in the social and natural sciences. Statistics have been developed for estimating the exponent parameter as well as gauging goodness-of-fit to a power law. Yet paradoxically, many famous power…
Researchers often estimate income statistics from summaries that report the number of incomes in bins such as \$0-10,000, \$10,001-20,000,...,\$200,000+. Some analysts assign incomes to bin midpoints, but this treats income as discrete.…
This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability…
For a larger set of predictions of several differently trained machine learning models, known as bagging predictors, the mean of all predictions is taken by default. Nevertheless, this proceeding can deviate from the actual ground truth in…
The mathematical properties of a family of generalized beta distribution, including beta-normal, skewed-t, log-F, beta-exponential, beta-Weibull distributions have recently been studied in several publications. This paper applies these…
We propose a new approach to mixed-frequency regressions in a high-dimensional environment that resorts to Group Lasso penalization and Bayesian techniques for estimation and inference. In particular, to improve the prediction properties of…
While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present…
In numerous instances, the generalized exponential distribution can be used as an alternative to the most widely used non-regular family of distributions: Weibull, gamma, lognormal with three-parameters when analyzing lifetime or any skewed…
Maximum likelihood fits to data can be performed using binned data and unbinned data. The likelihood fits in either case produce only the fitted quantities but not the goodness of fit. With binned data, one can obtain a measure of the…
Many applications involve estimating the mean of multiple binomial outcomes as a common problem -- assessing intergenerational mobility of census tracts, estimating prevalence of infectious diseases across countries, and measuring…
For an AI system to be reliable, the confidence it expresses in its decisions must match its accuracy. To assess the degree of match, examples are typically binned by confidence and the per-bin mean confidence and accuracy are compared.…
Food security is more prominent on the policy agenda today than it has been in the past, thanks to recent food shortages at both the regional and global levels as well as renewed promises from major donor countries to combat chronic hunger.…
When randomized ensembles such as bagging or random forests are used for binary classification, the prediction error of the ensemble tends to decrease and stabilize as the number of classifiers increases. However, the precise relationship…
We analyze the data on personal income distribution from the Australian Bureau of Statistics. We compare fits of the data to the exponential, log-normal, and gamma distributions. The exponential function gives a good (albeit not perfect)…
Many man-made and natural phenomena, including the intensity of earthquakes, population of cities and size of international wars, are believed to follow power-law distributions. The accurate identification of power-law patterns has…
Distributed Lag Models (DLMs) and similar regression approaches such as MIDAS have been used for many decades in econometrics and more recently to investigate how poor air quality adversely affects human health. In this paper we describe…
In environmental studies, many data are typically skewed and it is desired to have a flexible statistical model for this kind of data. In this paper, we study a class of skewed distributions by invoking arguments as described by Ferreira…
The theory of Local Intrinsic Dimensionality (LID) has become a valuable tool for characterizing local complexity within and across data manifolds, supporting a range of data mining and machine learning tasks. Accurate LID estimation…