Related papers: Succinct Data Structures for Assembling Large Geno…
Massively parallel DNA sequencing technologies are revolutionizing genomics research. Billions of short reads generated at low costs can be assembled for reconstructing the whole genomes. Unfortunately, the large memory footprint of the…
de Bruijn graph-based algorithms are one of the two most widely used approaches for de novo genome assembly. A major limitation of this approach is the large computational memory space requirement to construct the de Bruijn graph, which…
Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an…
The de Bruijn graph plays an important role in bioinformatics, especially in the context of de novo assembly. However, the representation of the de Bruijn graph in memory is a computational bottleneck for many assemblers. Recent papers…
The formal version of our work has been published in BMC Bioinformatics and can be found here: http://www.biomedcentral.com/1471-2105/13/S6/S1 Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based…
De novo genome assembly focuses on finding connections between a vast amount of short sequences in order to reconstruct the original genome. The central problem of genome assembly could be described as finding a Hamiltonian path through a…
The merging of succinct data structures is a well established technique for the space efficient construction of large succinct indexes. In the first part of the paper we propose a new algorithm for merging succinct representations of de…
De Bruijn graph is one of the most important data structures used in de-novo genome assembly algorithms, especially for NGS data. There is a growing need for parallel data structures and algorithms due to the increasing number of cores in…
Converting a set of sequencing reads into a lossless compact data structure that encodes all the relevant biological information is a major challenge. The classical approaches are to build the string graph or the de Bruijn graph. Each has…
Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to…
The first step in any genome assembly algorithm entails the conversion from the domain of strings and overlaps to the language of graphs and paths, typically using one of the two conventional methods: de Bruijn graphs or overlap graphs.…
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based…
(An updated version of this manuscript has been accepted to Scientific Reports in 2016, please refer to http://www.nature.com/articles/srep31900) The highly anticipated transition from next generation sequencing (NGS) to third generation…
Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as…
One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies…
Motivation: De Bruijn graphs have been proposed as a data structure to facilitate the analysis of related whole genome sequences, in both a population and comparative genomic settings. However, current approaches do not scale well to many…
Genome assembly is a prominent problem studied in bioinformatics, which computes the source string using a set of its overlapping substrings. Classically, genome assembly uses assembly graphs built using this set of substrings to compute…
De novo DNA assembly is a fundamental task in Bioinformatics, and finding Eulerian paths on de Bruijn graphs is one of the dominant approaches to it. In most of the cases, there may be no one order for the de Bruijn graph that works well…
De novo genome assembly is the process of stitching short DNA sequences to generate longer DNA sequences, without using any reference sequence for alignment. It enables high-throughput genome sequencing and thus accelerates the discovery of…
Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly…