English

Document Counting in Practice

Data Structures and Algorithms 2015-10-02 v2

Abstract

We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these solutions and develop some new variants, comparing them experimentally on various datasets. Our results not only show which are the best options for each situation and help discard practically unappealing solutions, but also uncover some unexpected compressibility properties of the best data structures. By taking advantage of these properties, we can reduce the size of the structures by a factor of 5--400, depending on the dataset.

Keywords

Cite

@article{arxiv.1409.6780,
  title  = {Document Counting in Practice},
  author = {Travis Gagie and Aleksi Hartikainen and Juha Kärkkäinen and Gonzalo Navarro and Simon J. Puglisi and Jouni Sirén},
  journal= {arXiv preprint arXiv:1409.6780},
  year   = {2015}
}

Comments

This is a slightly extended version of the paper that was presented at DCC 2015. The implementations are available at http://jltsiren.kapsi.fi/rlcsa and https://github.com/ahartik/succinct

R2 v1 2026-06-22T06:04:14.135Z