English

Contextual Pattern Mining and Counting

Data Structures and Algorithms 2025-06-24 v1 Databases

Abstract

Given a string PP of length mm, a longer string TT of length n>mn>m, and two integers l0l\geq 0 and r0r\geq 0, the context of PP in TT is the set of all string pairs (L,R)(L,R), with L=l|L|=l and R=r|R|=r, such that the string LPRLPR occurs in TT. We introduce two problems related to the notion of context: (1) the Contextual Pattern Mining (CPM) problem, which given TT, (m,l,r)(m,l,r), and an integer τ>0\tau>0, asks for outputting the context of each substring PP of length mm of TT, provided that the size of the context of PP is at least τ\tau; and (2) the Contextual Pattern Counting (CPC) problem, which asks for preprocessing TT so that the size of the context of a given query string PP of length mm can be found efficiently. For CPM, we propose a linear-work algorithm that either uses only internal memory, or a bounded amount of internal memory and external memory, which allows much larger datasets to be handled. For CPC, we propose an O~(n)\widetilde{\mathcal{O}}(n)-space index that can be constructed in O~n)\widetilde{\mathcal{O}}n) time and answers queries in O(m)+O~(1)\mathcal{O}(m)+\widetilde{\mathcal{O}}(1) time. We further improve the practical performance of the CPC index by optimizations that exploit the LZ77 factorization of TT and an upper bound on the query length. Using billion-letter datasets from different domains, we show that the external memory version of our CPM algorithm can deal with very large datasets using a small amount of internal memory while its runtime is comparable to that of the internal memory version. Interestingly, we also show that our optimized index for CPC outperforms an approach based on the state of the art for the reporting version of CPC [Navarro, SPIRE 2020] in terms of query time, index size, construction time, and construction space, often by more than an order of magnitude.

Keywords

Cite

@article{arxiv.2506.17613,
  title  = {Contextual Pattern Mining and Counting},
  author = {Ling Li and Daniel Gibney and Sharma V. Thankachan and Solon P. Pissis and Grigorios Loukides},
  journal= {arXiv preprint arXiv:2506.17613},
  year   = {2025}
}

Comments

27 pages, 15 figures