English

Sparse Regular Expression Matching

Data Structures and Algorithms 2023-11-07 v7

Abstract

A regular expression specifies a set of strings formed by single characters combined with concatenation, union, and Kleene star operators. Given a regular expression RR and a string QQ, the regular expression matching problem is to decide if QQ matches any of the strings specified by RR. Regular expressions are a fundamental concept in formal languages and regular expression matching is a basic primitive for searching and processing data. A standard textbook solution [Thompson, CACM 1968] constructs and simulates a nondeterministic finite automaton, leading to an O(nm)O(nm) time algorithm, where nn is the length of QQ and mm is the length of RR. Despite considerable research efforts only polylogarithmic improvements of this bound are known. Recently, conditional lower bounds provided evidence for this lack of progress when Backurs and Indyk [FOCS 2016] proved that, assuming the strong exponential time hypothesis (SETH), regular expression matching cannot be solved in O((nm)1ϵ)O((nm)^{1-\epsilon}), for any constant ϵ>0\epsilon > 0. Hence, the complexity of regular expression matching is essentially settled in terms of nn and mm. In this paper, we take a new approach and introduce a \emph{density} parameter, Δ\Delta, that captures the amount of nondeterminism in the NFA simulation on QQ. The density is at most nm+1nm+1 but can be significantly smaller. Our main result is a new algorithm that solves regular expression matching in O(ΔloglognmΔ+n+m)O\left(\Delta \log \log \frac{nm}{\Delta} +n + m\right) time. This essentially replaces nmnm with Δ\Delta in the complexity of regular expression matching. We complement our upper bound by a matching conditional lower bound that proves that we cannot solve regular expression matching in time O(Δ1ϵ)O(\Delta^{1-\epsilon}) for any constant ϵ>0\epsilon > 0 assuming SETH.

Keywords

Cite

@article{arxiv.1907.04752,
  title  = {Sparse Regular Expression Matching},
  author = {Philip Bille and Inge Li Gørtz},
  journal= {arXiv preprint arXiv:1907.04752},
  year   = {2023}
}
R2 v1 2026-06-23T10:17:33.750Z