English

PFP Data Structures

Data Structures and Algorithms 2020-06-23 v1

Abstract

Prefix-free parsing (PFP) was introduced by Boucher et al. (2019) as a preprocessing step to ease the computation of Burrows-Wheeler Transforms (BWTs) of genomic databases. Given a string SS, it produces a dictionary DD and a parse PP of overlapping phrases such that BWT(S)\mathrm{BWT} (S) can be computed from DD and PP in time and workspace bounded in terms of their combined size PFP(S)|\mathrm{PFP} (S)|. In practice DD and PP are significantly smaller than SS and computing BWT(S)\mathrm{BWT} (S) from them is more efficient than computing it from SS directly, at least when SS consists of genomes from individuals of the same species. In this paper, we consider PFP(S)\mathrm{PFP} (S) as a {\em data structure} and show how it can be augmented to support the following queries quickly, still in O(PFP(S))O (|\mathrm{PFP} (S)|) space: longest common extension (LCE), suffix array (SA), longest common prefix (LCP) and BWT. Lastly, we provide experimental evidence that the PFP data structure can be efficiently constructed for very large repetitive datasets: it takes one hour and 54 GB peak memory for 10001000 variants of human chromosome 19, initially occupying roughly 56 GB.

Keywords

Cite

@article{arxiv.2006.11687,
  title  = {PFP Data Structures},
  author = {Christina Boucher and Ondřej Cvacho and Travis Gagie and Jan Holub and Giovanni Manzini and Gonzalo Navarro and Massimiliano Rossi},
  journal= {arXiv preprint arXiv:2006.11687},
  year   = {2020}
}
R2 v1 2026-06-23T16:29:28.037Z