English

Practical and Effective Re-Pair Compression

Data Structures and Algorithms 2017-04-28 v1

Abstract

Re-Pair is an efficient grammar compressor that operates by recursively replacing high-frequency character pairs with new grammar symbols. The most space-efficient linear-time algorithm computing Re-Pair uses (1+ϵ)n+n(1+\epsilon)n+\sqrt n words on top of the re-writable text (of length nn and stored in nn words), for any constant ϵ>0\epsilon>0; in practice however, this solution uses complex sub-procedures preventing it from being practical. In this paper, we present an implementation of the above-mentioned result making use of more practical solutions; our tool further improves the working space to (1.5+ϵ)n(1.5+\epsilon)n words (text included), for some small constant ϵ\epsilon. As a second contribution, we focus on compact representations of the output grammar. The lower bound for storing a grammar with dd rules is log(d!)+2ddlogd+0.557d\log(d!)+2d\approx d\log d+0.557 d bits, and the most efficient encoding algorithm in the literature uses at most dlogd+2dd\log d + 2d bits and runs in O(d1.5)\mathcal O(d^{1.5}) time. We describe a linear-time heuristic maximizing the compressibility of the output Re-Pair grammar. On real datasets, our grammar encoding uses---on average---only 2.8%2.8\% more bits than the information-theoretic minimum. In half of the tested cases, our compressor improves the output size of 7-Zip with maximum compression rate turned on.

Keywords

Cite

@article{arxiv.1704.08558,
  title  = {Practical and Effective Re-Pair Compression},
  author = {Philip Bille and Inge Li Gørtz and Nicola Prezza},
  journal= {arXiv preprint arXiv:1704.08558},
  year   = {2017}
}