Related papers: Non-uniform quantization with linear average-case …
Conformal prediction constructs a set of labels instead of a single point prediction, while providing a probabilistic coverage guarantee. Beyond the coverage guarantee, adaptiveness to example difficulty is an important property. It means…
In the Bin Packing problem one is given $n$ items with weights $w_1,\ldots,w_n$ and $m$ bins with capacities $c_1,\ldots,c_m$. The goal is to find a partition of the items into sets $S_1,\ldots,S_m$ such that $w(S_j) \leq c_j$ for every bin…
Local moments are used for local regression, to compute statistical measures such as sums, averages, and standard deviations, and to approximate probability distributions. We consider the case where the data source is a very large I/O array…
Binned scatter plots are a powerful statistical tool for empirical work in the social, behavioral, and biomedical sciences. Available methods rely on a quantile-based partitioning estimator of the conditional mean regression function to…
This paper describes a new median algorithm and a median approximation algorithm. The former has O(n) average running time and the latter has O(n) worst-case running time. These algorithms are highly competitive with the standard algorithm…
Here we present a novel approach to statistical analysis of financial time series. The approach is based on $n$-grams frequency dictionaries derived from the quantized market data. Such dictionaries are studied by evaluating their…
Binning (a.k.a. discretization) of numerically continuous measurements is a wide-spread but controversial practice in data collection, analysis, and presentation. The consequences of binning have been evaluated for many different kinds of…
In binary classification, there are situations where negative (N) data are too diverse to be fully labeled and we often resort to positive-unlabeled (PU) learning in these scenarios. However, collecting a non-representative N set that…
This work proposes a non-iterative strategy for missing value imputations which is guided by similarity between observations, but instead of explicitly determining distances or nearest neighbors, it assigns observations to overlapping…
When reading peer-reviewed scientific literature describing any analysis of empirical data, it is natural and correct to proceed with the underlying assumption that experiments have made good faith efforts to ensure that their analyses…
Data discretization, also known as binning, is a frequently used technique in computer science, statistics, and their applications to biological data analysis. We present a new method for the discretization of real-valued data into a finite…
The method of random projections has become very popular for large-scale applications in statistical learning, information retrieval, bio-informatics and other applications. Using a well-designed coding scheme for the projected data, which…
Reliable density estimation is fundamental for numerous applications in statistics and machine learning. In many practical scenarios, data are best modeled as mixtures of component densities that capture complex and multimodal patterns.…
Despite significant progress in the caching literature concerning the worst case and uniform average case regimes, the algorithms for caching with nonuniform demands are still at a basic stage and mostly rely on simple grouping and…
The bin packing problem is to find the minimum number of bins of size one to pack a list of items with sizes $a_1,..., a_n$ in $(0,1]$. Using uniform sampling, which selects a random element from the input list each time, we develop a…
This paper introduces a new type of probabilistic semiparametric model that takes advantage of data binning to reduce the computational cost of kernel density estimation in nonparametric distributions. Two new conditional probability…
We study memory-efficient matrix factorization for differentially private counting under continual observation. While recent work by Henzinger and Upadhyay 2024 introduced a factorization method with reduced error based on group algebra,…
Mutual Information (MI) is a powerful statistical measure that quantifies shared information between random variables, particularly valuable in high-dimensional data analysis across fields like genomics, natural language processing, and…
We introduce a new kind of likelihood function based on the sequence of moments of the data distribution. Both binned and unbinned data samples are discussed, and the multivariate case is also derived. Building on this approach we lay out…
Random binning is an efficient, yet complex, coding technique for the symmetric L-description source coding problem. We propose an alternative approach, that uses the quantized samples of a bandlimited source as "descriptions". By the…