Related papers: String Attractors: Verification and Optimization
In today's data-centric world, fast and effective compression of data is paramount. To measure success towards the second goal, Kempa and Prezza [STOC2018] introduce the string attractor, a combinatorial object unifying dictionary-based…
Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's…
A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary…
A $k$-attractor is a combinatorial object unifying dictionary-based compression. It allows to compare the repetitiveness measures of different dictionary compressors such as Lempel-Ziv 77, the Burrows-Wheeler transform, straight line…
The notion of \emph{string attractor} has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word…
The notion of string attractor has been introduced in [Kempa and Prezza, 2018] in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be "attracted". The smallest size…
We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings,…
Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of…
The article focuses on word (or string) attractors, which are sets of positions related to the text compression efficiency of the underlying word. The article presents two combinatorial algorithms based on Suffix automata or Directed…
The random access problem for compressed strings is to build a data structure that efficiently supports accessing the character in position $i$ of a string given in compressed form. Given a grammar of size $n$ compressing a string of size…
We study the problem of supporting queries on a string $S$ of length $n$ within a space bounded by the size $\gamma$ of a string attractor for $S$. Recent works showed that random access on $S$ can be supported in optimal…
A string attractor of a string $T[1..|T|]$ is a set of positions $\Gamma$ of $T$ such that any substring $w$ of $T$ has an occurrence that crosses a position in $\Gamma$, i.e., there is a position $i$ such that $w = T[i..i+|w|-1]$ and the…
Firstly studied by Kempa and Prezza in 2018 as the cement of text compression algorithms, string attractors have become a compelling object of theoretical research within the community of combinatorics on words. In this context, they have…
Suppose an oracle knows a string $S$ that is unknown to us and that we want to determine. The oracle can answer queries of the form "Is $s$ a substring of $S$?". In 1995, Skiena and Sundaram showed that, in the worst case, any algorithm…
The classic exact pattern matching problem, given two strings -- a pattern $P$ of length $m$ and a text $T$ of length $n$ -- asks whether $P$ occurs as a substring of $T$. A property tester for the problem needs to distinguish (with high…
Given strings $P$ and $Q$ the (exact) string matching problem is to find all positions of substrings in $Q$ matching $P$. The classical Knuth-Morris-Pratt algorithm [SIAM J. Comput., 1977] solves the string matching problem in linear time…
Approximate pattern matching is a natural and well-studied problem on strings: Given a text $T$, a pattern $P$, and a threshold $k$, find (the starting positions of) all substrings of $T$ that are at distance at most $k$ from $P$. We…
Let $\mathcal{D}$ be a collection of $D$ documents, which are strings over an alphabet of size $\sigma$, of total length $n$. We describe a data structure that uses linear space and and reports $k$ most relevant documents that contain a…
We study the fundamental problem of finding the best string to represent a given set, in the form of the Closest String problem: Given a set $X \subseteq \Sigma^d$ of $n$ strings, find the string $x^*$ minimizing the radius of the smallest…
Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the…