English

Computing Data Distribution from Query Selectivities

Data Structures and Algorithms 2024-01-12 v1 Databases

Abstract

We are given a set Z={(R1,s1),,(Rn,sn)}\mathcal{Z}=\{(R_1,s_1),\ldots, (R_n,s_n)\}, where each RiR_i is a \emph{range} in d\Re^d, such as rectangle or ball, and si[0,1]s_i \in [0,1] denotes its \emph{selectivity}. The goal is to compute a small-size \emph{discrete data distribution} D={(q1,w1),,(qm,wm)}\mathcal{D}=\{(q_1,w_1),\ldots, (q_m,w_m)\}, where qjdq_j\in \Re^d and wj[0,1]w_j\in [0,1] for each 1jm1\leq j\leq m, and 1jmwj=1\sum_{1\leq j\leq m}w_j= 1, such that D\mathcal{D} is the most \emph{consistent} with Z\mathcal{Z}, i.e., errp(D,Z)=1ni=1n ⁣sij=1mwj1(qjRi)p\mathrm{err}_p(\mathcal{D},\mathcal{Z})=\frac{1}{n}\sum_{i=1}^n\! \lvert{s_i-\sum_{j=1}^m w_j\cdot 1(q_j\in R_i)}\rvert^p is minimized. In a database setting, Z\mathcal{Z} corresponds to a workload of range queries over some table, together with their observed selectivities (i.e., fraction of tuples returned), and D\mathcal{D} can be used as compact model for approximating the data distribution within the table without accessing the underlying contents. In this paper, we obtain both upper and lower bounds for this problem. In particular, we show that the problem of finding the best data distribution from selectivity queries is NP\mathsf{NP}-complete. On the positive side, we describe a Monte Carlo algorithm that constructs, in time O((n+δd)δ2polylog)O((n+\delta^{-d})\delta^{-2}\mathop{\mathrm{polylog}}), a discrete distribution D~\tilde{\mathcal{D}} of size O(δ2)O(\delta^{-2}), such that errp(D~,Z)minDerrp(D,Z)+δ\mathrm{err}_p(\tilde{\mathcal{D}},\mathcal{Z})\leq \min_{\mathcal{D}}\mathrm{err}_p(\mathcal{D},\mathcal{Z})+\delta (for p=1,2,p=1,2,\infty) where the minimum is taken over all discrete distributions. We also establish conditional lower bounds, which strongly indicate the infeasibility of relative approximations as well as removal of the exponential dependency on the dimension for additive approximations. This suggests that significant improvements to our algorithm are unlikely.

Keywords

Cite

@article{arxiv.2401.06047,
  title  = {Computing Data Distribution from Query Selectivities},
  author = {Pankaj K. Agarwal and Rahul Raychaudhury and Stavros Sintos and Jun Yang},
  journal= {arXiv preprint arXiv:2401.06047},
  year   = {2024}
}
R2 v1 2026-06-28T14:14:28.417Z