English

Substring Complexity in Sublinear Space

Data Structures and Algorithms 2023-11-16 v2

Abstract

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size zz of the Lempel-Ziv parse or the number rr of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ\gamma of a smallest string attractor. Let TT be a string of length nn. A string attractor of TT is a set of positions of TT capturing the occurrences of all the substrings of TT. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ\gamma is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function ST(k)S_T(k) counting the number of distinct substrings of length kk of TT, also known as the substring complexity of TT. This new measure is defined as δ=sup{ST(k)/k,k1}\delta= \sup\{S_T(k)/k, k\geq 1\} and lower bounds all the relevant ad hoc measures previously considered. In particular, δγ\delta\leq \gamma always holds and δ\delta can be computed in O(n)\mathcal{O}(n) time using Θ(n)\Theta(n) working space. Kociumaka et al. showed that one can construct an O(δlognδ)\mathcal{O}(\delta \log \frac{n}{\delta})-sized representation of TT supporting efficient direct access and efficient pattern matching queries on TT. Given that for highly compressible strings, δ\delta is significantly smaller than nn, it is natural to pose the following question: Can we compute δ\delta efficiently using sublinear working space? We address this algorithmic challenge by showing the following bounds to compute δ\delta: O(n3logbb2)\mathcal{O}(\frac{n^3\log b}{b^2}) time using O(b)\mathcal{O}(b) space, for any b[1,n]b\in[1,n], in the comparison model; or O~(n2/b)\tilde{\mathcal{O}}(n^2/b) time using O~(b)\tilde{\mathcal{O}}(b) space, for any b[n,n]b\in[\sqrt{n},n], in the word RAM model.

Keywords

Cite

@article{arxiv.2007.08357,
  title  = {Substring Complexity in Sublinear Space},
  author = {Giulia Bernardini and Gabriele Fici and Paweł Gawrychowski and Solon P. Pissis},
  journal= {arXiv preprint arXiv:2007.08357},
  year   = {2023}
}

Comments

Accepted to ISAAC 2023. Abstract abridged to satisfy arXiv requirements