English

Approximating Text-to-Pattern Hamming Distances

Data Structures and Algorithms 2020-01-03 v1

Abstract

We revisit a fundamental problem in string matching: given a pattern of length m and a text of length n, both over an alphabet of size σ\sigma, compute the Hamming distance between the pattern and the text at every location. Several (1+ϵ)(1+\epsilon)-approximation algorithms have been proposed in the literature, with running time of the form O(ϵO(1)nlognlogm)O(\epsilon^{-O(1)}n\log n\log m), all using fast Fourier transform (FFT). We describe a simple (1+ϵ)(1+\epsilon)-approximation algorithm that is faster and does not need FFT. Combining our approach with additional ideas leads to numerous new results: - We obtain the first linear-time approximation algorithm; the running time is O(ϵ2n)O(\epsilon^{-2}n). - We obtain a faster exact algorithm computing all Hamming distances up to a given threshold k; its running time improves previous results by logarithmic factors and is linear if kmk\le\sqrt m. - We obtain approximation algorithms with better ϵ\epsilon-dependence using rectangular matrix multiplication. The time-bound is O˜(n)\~O(n) when the pattern is sufficiently long: mϵ28m\ge \epsilon^{-28}. Previous algorithms require O˜(ϵ1n)\~O(\epsilon^{-1}n) time. - When k is not too small, we obtain a truly sublinear-time algorithm to find all locations with Hamming distance approximately (up to a constant factor) less than k, in O((n/kΩ(1)+occ)no(1))O((n/k^{\Omega(1)}+occ)n^{o(1)}) time, where occ is the output size. The algorithm leads to a property tester, returning true if an exact match exists and false if the Hamming distance is more than δm\delta m at every location, running in O˜(δ1/3n2/3+δ1n/m)\~O(\delta^{-1/3}n^{2/3}+\delta^{-1}n/m) time. - We obtain a streaming algorithm to report all locations with Hamming distance approximately less than k, using O˜(ϵ2k)\~O(\epsilon^{-2}\sqrt k) space. Previously, streaming algorithms were known for the exact problem with \~O(k) space or for the approximate problem with O˜(ϵO(1)m)\~O(\epsilon^{-O(1)}\sqrt m) space.

Keywords

Cite

@article{arxiv.2001.00211,
  title  = {Approximating Text-to-Pattern Hamming Distances},
  author = {Timothy M. Chan and Shay Golan and Tomasz Kociumaka and Tsvi Kopelowitz and Ely Porat},
  journal= {arXiv preprint arXiv:2001.00211},
  year   = {2020}
}