Hidden Words Statistics for Large Patterns

Svante Janson; Wojciech Szpankowski

Hidden Words Statistics for Large Patterns

Probability 2020-03-24 v1 Data Structures and Algorithms

Authors: Svante Janson , Wojciech Szpankowski

Abstract

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$ . The quantity of interest is the number of occurrences of $w$ as a subsequence (i.e., occurring in not necessarily consecutive text locations). This problem finds many applications from intrusion detection, to trace reconstruction, to deletion channel, and to DNA-based storage systems. In all of these applications, the pattern $w$ is of variable length. To the best of our knowledge this problem was only tackled for a fixed length $m=O(1)$ [Flajolet, Szpankowski and Vall\'ee, 2006]. In our main result we prove that for $m=o(n^{1/3})$ the number of subsequence occurrences is normally distributed. In addition, we show that under some constraints on the structure of $w$ the asymptotic normality can be extended to $m=o(\sqrt{n})$ . For a special pattern $w$ consisting of the same symbol, we indicate that for $m=o(n)$ the distribution of number of subsequences is either asymptotically normal or asymptotically log normal. We conjecture that this dichotomy is true for all patterns. We use Hoeffding's projection method for $U$ -statistics to prove our findings.

Keywords

sturmian words and combinatorics on words collatz conjecture integer sequences and recurrences

Cite

@article{arxiv.2003.09584,
  title  = {Hidden Words Statistics for Large Patterns},
  author = {Svante Janson and Wojciech Szpankowski},
  journal= {arXiv preprint arXiv:2003.09584},
  year   = {2020}
}

Comments

22 pages

Hidden Words Statistics for Large Patterns

Abstract

Keywords

Cite

Comments

Related papers