English

Hidden Words Statistics for Large Patterns

Probability 2020-03-24 v1 Data Structures and Algorithms

Abstract

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern ww of length mm as a subsequence in a random text of length nn. The quantity of interest is the number of occurrences of ww as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern ww is of variable length. To the best of our knowledge this problem was only tackled for a fixed length m=O(1)m=O(1) [Flajolet, Szpankowski and Vall\'ee, 2006]. In our main result we prove that for m=o(n1/3)m=o(n^{1/3}) the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of ww the asymptotic normality can be extended to m=o(n)m=o(\sqrt{n}). For a special pattern ww consisting of the same symbol, we indicate that for m=o(n)m=o(n) the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for UU-statistics to prove our findings.

Keywords

Cite

@article{arxiv.2003.09584,
  title  = {Hidden Words Statistics for Large Patterns},
  author = {Svante Janson and Wojciech Szpankowski},
  journal= {arXiv preprint arXiv:2003.09584},
  year   = {2020}
}

Comments

22 pages