English

Compressed Multiple Pattern Matching

Data Structures and Algorithms 2019-04-02 v2

Abstract

Given dd strings over the alphabet {0,1,,σ1}\{0,1,\ldots,\sigma{-}1\}, the classical Aho--Corasick data structure allows us to find all occocc occurrences of the strings in any text TT in O(T+occ)O(|T| + occ) time using O(mlogm)O(m\log m) bits of space, where mm is the number of edges in the trie containing the strings. Fix any constant ε(0,2)\varepsilon \in (0, 2). We describe a compressed solution for the problem that, provided σmδ\sigma \le m^\delta for a constant δ<1\delta < 1, works in O(T1εlog1ε+occ)O(|T| \frac{1}{\varepsilon} \log\frac{1}{\varepsilon} + occ) time, which is O(T+occ)O(|T| + occ) since ε\varepsilon is constant, and occupies mHk+1.443m+εm+O(dlogmd)mH_k + 1.443 m + \varepsilon m + O(d\log\frac{m}{d}) bits of space, for all 0kmax{0,αlogσm2}0 \le k \le \max\{0,\alpha\log_\sigma m - 2\} simultaneously, where α(0,1)\alpha \in (0,1) is an arbitrary constant and HkH_k is the kkth-order empirical entropy of the trie. Hence, we reduce the 3.443m3.443m term in the space bounds of previously best succinct solutions to (1.443+ε)m(1.443 + \varepsilon)m, thus solving an open problem posed by Belazzougui. Further, we notice that L=log(σ(m+1)m)O(log(σm))L = \log\binom{\sigma (m+1)}{m} - O(\log(\sigma m)) is a worst-case space lower bound for any solution of the problem and, for d=o(m)d = o(m) and constant ε\varepsilon, our approach allows to achieve L+εmL + \varepsilon m bits of space, which gives an evidence that, for d=o(m)d = o(m), the space of our data structure is theoretically optimal up to the εm\varepsilon m additive term and it is hardly possible to eliminate the term 1.443m1.443m. In addition, we refine the space analysis of previous works by proposing a more appropriate definition for HkH_k. We also simplify the construction for practice adapting the fixed block compression boosting technique, then implement our data structure, and conduct a number of experiments showing that it is comparable to the state of the art in terms of time and is superior in space.

Keywords

Cite

@article{arxiv.1811.01248,
  title  = {Compressed Multiple Pattern Matching},
  author = {Dmitry Kosolobov and Nikita Sivukhin},
  journal= {arXiv preprint arXiv:1811.01248},
  year   = {2019}
}

Comments

14 pages, 3 figures, 1 table