Related papers: Genome Compression Against a Reference
DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while…
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid…
An approach for approximately calculating the number of genes in a genome is presented, which takes into account the average protein length expected for the species. A number of virus, bacterial and eukaryotic genomes are scrutinized.…
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic…
We live in a period where bio-informatics is rapidly expanding, a significant quantity of genomic data has been produced as a result of the advancement of high-throughput genome sequencing technology, raising concerns about the costs…
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv…
Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be…
Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an…
We present a Compression Tool, "GenBit Compress", for genetic sequences based on our new proposed "GenBit Compress Algorithm". Our Tool achieves the best compression ratios for Entire Genome (DNA sequences) . Significantly better…
Long-context DNA models are limited by token-mixing cost and by how compression allocates representational budget across the genome. Existing approaches operate close to base-pair resolution, apply fixed downsampling, or learn…
While achieving a compression ratio of 2.0 bits/base, the new algorithm codes non-N bases in fixed length. It dramatically reduces the time of coding and decoding than previous DNA compression algorithms and some universal compression…
Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information…
Motivation: The availability of thousands of invidual genomes of one species should boost rapid progress in personalized medicine or understanding of the interaction between genotype and phenotype, to name a few applications. A key…
Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being…
We define the complexity of DNA sequences as the information content per nucleotide, calculated by means of some Lempel-Ziv data compression algorithm. It is possible to use the statistics of the complexity values of the functional regions…
DNA is a leading candidate as the next archival storage media due to its density, durability and sustainability. To read (and write) data DNA storage exploits technology that has been developed over decades to sequence naturally occurring…
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform…
Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ…
Over the past several years, DNA sequencing has emerged as one of the driving forces in life-sciences, paving the way for affordable and accurate whole genome sequencing. As genomes represent the entirety of an organism's hereditary…
Single-nucleotide polymorphisms (SNPs) account for most variations between human genomes. We show how, if the genomes in a database differ only by a reasonable number of SNPs and the substrings between those SNPs are unique, then we can…