Related papers: Substring Complexity in Sublinear Space

Towards a Definitive Compressibility Measure for Repetitive Sequences

Unlike in statistical compression, where Shannon's entropy is a definitive lower bound, no such clear measure exists for the compressibility of repetitive sequences. Since statistical entropy does not capture repetitiveness, ad-hoc measures…

Data Structures and Algorithms · Computer Science 2021-01-18 Tomasz Kociumaka , Gonzalo Navarro , Nicola Prezza

Substring Complexities on Run-length Compressed Strings

Let $S_{T}(k)$ denote the set of distinct substrings of length $k$ in a string $T$, then the $k$-th substring complexity is defined by its cardinality $|S_{T}(k)|$. Recently, $\delta = \max \{ |S_{T}(k)| / k : k \ge 1 \}$ is shown to be a…

Data Structures and Algorithms · Computer Science 2022-05-26 Akiyoshi Kawamoto , Tomohiro I

Generalization of Repetitiveness Measures for Two-Dimensional Strings

The problem of detecting and measuring the repetitiveness of one-dimensional strings has been extensively studied in data compression and text indexing. Our understanding of these issues has been significantly improved by the introduction…

Data Structures and Algorithms · Computer Science 2025-05-19 Lorenzo Carfagna , Giovanni Manzini , Giuseppe Romana , Marinella Sciortino , Cristian Urbina

Sensitivity of string compressors and repetitiveness measures

The sensitivity of a string compression algorithm $C$ asks how much the output size $C(T)$ for an input string $T$ can increase when a single character edit operation is performed on $T$. This notion enables one to measure the robustness of…

Data Structures and Algorithms · Computer Science 2023-02-10 Tooru Akagi , Mitsuru Funakoshi , Shunsuke Inenaga

String Attractors

Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's…

Data Structures and Algorithms · Computer Science 2017-09-20 Nicola Prezza

On repetitiveness measures of Thue-Morse words

We show that the size $\gamma(t_n)$ of the smallest string attractor of the $n$th Thue-Morse word $t_n$ is 4 for any $n\geq 4$, disproving the conjecture by Mantaci et al. [ICTCS 2019] that it is $n$. We also show that $\delta(t_n) =…

Data Structures and Algorithms · Computer Science 2020-08-13 Kanaru Kutsukake , Takuya Matsumoto , Yuto Nakashima , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda

Computing Matching Statistics on Repetitive Texts

Computing the {\em matching statistics} of a string $P[1..m]$ with respect to a text $T[1..n]$ is a fundamental problem which has application to genome sequence comparison. In this paper, we study the problem of computing the matching…

Data Structures and Algorithms · Computer Science 2022-01-14 Younan Gao

Exploring Repetitiveness Measures for Two-Dimensional Strings

Detecting and measuring repetitiveness of strings is a problem that has been extensively studied in data compression and text indexing. However, when the data are structured in a non-linear way, like in the context of two-dimensional…

Data Structures and Algorithms · Computer Science 2024-04-11 Giuseppe Romana , Marinella Sciortino , Cristian Urbina

String Attractors and Infinite Words

The notion of string attractor has been introduced in [Kempa and Prezza, 2018] in the context of Data Compression and it represents a set of positions of a finite word in which all of its factors can be "attracted". The smallest size…

Formal Languages and Automata Theory · Computer Science 2022-06-02 Antonio Restivo , Giuseppe Romana , Marinella Sciortino

String Attractors: Verification and Optimization

String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $\Gamma\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..\sigma]^n$ if and…

Data Structures and Algorithms · Computer Science 2020-12-09 Dominik Kempa , Alberto Policriti , Nicola Prezza , Eva Rotenberg

At the Roots of Dictionary Compression: String Attractors

A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary…

Data Structures and Algorithms · Computer Science 2020-12-17 Dominik Kempa , Nicola Prezza

Optimal-Time Dictionary-Compressed Indexes

We describe the first self-indexes able to count and locate pattern occurrences in optimal time within a space bounded by the size of the most popular dictionary compressors. To achieve this result we combine several recent findings,…

Data Structures and Algorithms · Computer Science 2019-09-06 Anders Roy Christiansen , Mikko Berggren Ettienne , Tomasz Kociumaka , Gonzalo Navarro , Nicola Prezza

Tight Lower Bounds for Central String Queries in Compressed Space

In this work, we study the limits of compressed data structures, i.e., structures that support various queries on an input text $T\in\Sigma^n$ using space proportional to the size of $T$ in compressed form. Nearly all fundamental queries…

Data Structures and Algorithms · Computer Science 2025-10-23 Dominik Kempa , Tomasz Kociumaka

Compressed Index with Construction in Compressed Space

Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $\delta$ is the string complexity of $s$, a known compression measure. We describe an index on $s$ with $O(\delta\log\frac{n}{\delta})$…

Data Structures and Algorithms · Computer Science 2026-04-15 Dmitry Kosolobov

On Stricter Reachable Repetitiveness Measures*

The size $b$ of the smallest bidirectional macro scheme, which is arguably the most general copy-paste scheme to generate a given sequence, is considered to be the strictest reachable measure of repetitiveness. It is strictly lower-bounded…

Data Structures and Algorithms · Computer Science 2021-05-31 Gonzalo Navarro , Cristian Urbina

Online computation of normalized substring complexity

The normalized substring complexity $\delta$ of a string is defined as $\max_k \{c[k]/k\}$, where $c[k]$ is the number of \textit{distinct} substrings of length $k$. This simply defined measure has recently attracted attention due to its…

Data Structures and Algorithms · Computer Science 2026-02-17 Gregory Kucherov , Yakov Nekrich

Online String Attractors

In today's data-centric world, fast and effective compression of data is paramount. To measure success towards the second goal, Kempa and Prezza [STOC2018] introduce the string attractor, a combinatorial object unifying dictionary-based…

Data Structures and Algorithms · Computer Science 2024-07-23 Philip Whittington

Sketching and Streaming for Dictionary Compression

We initiate the study of sub-linear sketching and streaming techniques for estimating the output size of common dictionary compressors such as Lempel-Ziv '77, the run-length Burrows-Wheeler transform, and grammar compression. To this end,…

Data Structures and Algorithms · Computer Science 2024-08-20 Ruben Becker , Matteo Canton , Davide Cenzato , Sung-Hwan Kim , Bojana Kodric , Nicola Prezza

The landscape of compressibility measures for two-dimensional data

In this paper we extend to two-dimensional data two recently introduced one-dimensional compressibility measures: the $\gamma$ measure defined in terms of the smallest string attractor, and the $\delta$ measure defined in terms of the…

Data Structures and Algorithms · Computer Science 2024-05-21 Lorenzo Carfagna , Giovanni Manzini

Sublinear Algorithms for Approximating String Compressibility

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE)…

Data Structures and Algorithms · Computer Science 2007-06-11 Sofya Raskhodnikova , Dana Ron , Ronitt Rubinfeld , Adam Smith