Related papers: Approximating Optimal Bidirectional Macro Schemes
We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE)…
Sequence processing has long been a central area of machine learning research. Recurrent neural nets have been successful in processing sequences for a number of tasks; however, they are known to be both ineffective and computationally…
We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well…
One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input…
Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the…
Relative Lempel-Ziv (RLZ) parsing is a dictionary compression method in which a string $S$ is compressed relative to a second string $R$ (called the reference) by parsing $S$ into a sequence of substrings that occur in $R$. RLZ is…
We define an algorithm that parses multidimensional arrays sequentially into mainly unrepeated but nested multidimensional sub-arrays of increasing size, and show that the resulting sub-block pointer encoder compresses almost every…
Lossless data compression has been widely studied in computer science. One of the most widely used lossless data compressions is Lempel-Zip(LZ) 77 parsing, which achieves a high compression ratio. Bidirectional (a.k.a. macro) parsing is a…
The complexity of computing the Lempel-Ziv factorization and the set of all runs (= maximal repetitions) is studied in the decision tree model of computation over ordered alphabet. It is known that both these problems can be solved by RAM…
Finding approximate occurrences of a pattern in a text using a full-text index is a central problem in bioinformatics and has been extensively researched. Bidirectional indices have opened new possibilities in this regard allowing the…
We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that…
For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least squares estimate in linear regression, where…
Relative Lempel-Ziv (RLZ) is a popular algorithm for compressing databases of genomes from individuals of the same species when fast random access is desired. With Kuruppu et al.'s (SPIRE 2010) original implementation, a reference genome is…
LZ-End is a variant of the well-known Lempel-Ziv parsing family such that each phrase of the parsing has a previous occurrence, with the additional constraint that the previous occurrence must end at the end of a previous phrase. LZ-End was…
The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and widely-used algorithmic tools in data compression and string processing. We show that the Lempel-Ziv parsing of a string of length $n$ on an alphabet of…
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that…
This paper studies a class of simple bilevel optimization problems where we minimize a composite convex function at the upper-level subject to a composite convex lower-level problem. Existing methods either provide asymptotic guarantees for…
Random sequences attain the highest entropy rate. The estimation of entropy rate for an ergodic source can be done using the Lempel Ziv complexity measure yet, the exact entropy rate value is only reached in the infinite limit. We prove…
Shannon's entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is $b$, the least number of phrases of a general bidirectional parse of…
We present a new, simple, and efficient approach for computing the Lempel-Ziv (LZ77) factorization of a string in linear time, based on suffix arrays. Computational experiments on various data sets show that our approach constantly…