English

Variable-Order de Bruijn Graphs

Data Structures and Algorithms 2014-11-18 v2 Genomics Quantitative Methods

Abstract

The de Bruijn graph GKG_K of a set of strings SS is a key data structure in genome assembly that represents overlaps between all the KK-length substrings of SS. Construction and navigation of the graph is a space and time bottleneck in practice and the main hurdle for assembling large, eukaryote genomes. This problem is compounded by the fact that state-of-the-art assemblers do not build the de Bruijn graph for a single order (value of KK) but for multiple values of KK. More precisely, they build dd de Bruijn graphs, each with a specific order, i.e., GK1,GK2,...,GKdG_{K_1}, G_{K_2}, ..., G_{K_d}. Although, this paradigm increases the quality of the assembly produced, it increases the memory by a factor of dd in most cases. In this paper, we show how to augment a succinct de Bruijn graph representation by Bowe et al. (Proc. WABI, 2012) to support new operations that let us change order on the fly, effectively representing all de Bruijn graphs of order up to some maximum KK in a single data structure. Our experiments show our variable-order de Bruijn graph only modestly increases space usage, construction time, and navigation time compared to a single order graph.

Keywords

Cite

@article{arxiv.1411.2718,
  title  = {Variable-Order de Bruijn Graphs},
  author = {Christina Boucher and Alex Bowe and Travis Gagie and Simon J. Puglisi and Kunihiko Sadakane},
  journal= {arXiv preprint arXiv:1411.2718},
  year   = {2014}
}

Comments

Conference submission, 10 pages, +minor corrections

R2 v1 2026-06-22T06:54:22.690Z