English

Efficient Computation of Sequence Mappability

Data Structures and Algorithms 2021-06-18 v3

Abstract

In the (k,m)(k,m)-mappability problem, for a given sequence TT of length nn, the goal is to compute a table whose iith entry is the number of indices jij \ne i such that the length-mm substrings of TT starting at positions ii and jj have at most kk mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of k=1k=1. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for k=O(1)k=\mathcal{O}(1), works in O(n)\mathcal{O}(n) space and, with high probability, in O(nmin{mk,logkn})\mathcal{O}(n \cdot \min\{m^k,\log^k n\}) time. Our algorithm requires a careful adaptation of the kk-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop O(n2)\mathcal{O}(n^2)-time algorithms to compute all (k,m)(k,m)-mappability tables for a fixed mm and all k{0,,m}k\in \{0,\ldots,m\} or a fixed kk and all m{k,,n}m\in\{k,\ldots,n\}. Finally, we show that, for k,m=Θ(logn)k,m = \Theta(\log n), the (k,m)(k,m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper that was presented at SPIRE 2018.

Keywords

Cite

@article{arxiv.1807.11702,
  title  = {Efficient Computation of Sequence Mappability},
  author = {Panagiotis Charalampopoulos and Costas S. Iliopoulos and Tomasz Kociumaka and Solon P. Pissis and Jakub Radoszewski and Juliusz Straszyński},
  journal= {arXiv preprint arXiv:1807.11702},
  year   = {2021}
}

Comments

Accepted to SPIRE 2018