Practical and Effective Re-Pair Compression

Philip Bille; Inge Li Gørtz; Nicola Prezza

Practical and Effective Re-Pair Compression

Data Structures and Algorithms 2017-04-28 v1

Authors: Philip Bille , Inge Li Gørtz , Nicola Prezza

Abstract

Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses $(1+\epsilon)n+\sqrt n$ words on top of the re-writable text (of length $n$ and stored in $n$ words), for any constant $\epsilon>0$ ; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the above-mentioned result making use of more practical solutions; our tool further improves the working space to $(1.5+\epsilon)n$ words (text included), for some small constant $\epsilon$ . As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with $d$ rules is $\log(d!)+2d\approx d\log d+0.557 d$ bits, and the most efficient encoding algorithm in the literature uses at most $d\log d + 2d$ bits and runs in $\mathcal O(d^{1.5})$ time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses---on average---only $2.8\%$ more bits than the information-theoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on.

Keywords

string algorithms source coding succinct data structure

Cite

@article{arxiv.1704.08558,
  title  = {Practical and Effective Re-Pair Compression},
  author = {Philip Bille and Inge Li Gørtz and Nicola Prezza},
  journal= {arXiv preprint arXiv:1704.08558},
  year   = {2017}
}

Practical and Effective Re-Pair Compression

Abstract

Keywords

Cite

Related papers