English

Reference Based Genome Compression

Information Theory 2016-11-15 v1 math.IT

Abstract

DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while viable, cannot offer the same savings as approaches tuned to inherent biological properties. We propose an algorithm to compress a target genome given a known reference genome. The proposed algorithm first generates a mapping from the reference to the target genome, and then compresses this mapping with an entropy coder. As an illustration of the performance: applying our algorithm to James Watson's genome with hg18 as a reference, we are able to reduce the 2991 megabyte (MB) genome down to 6.99 MB, while Gzip compresses it to 834.8 MB.

Keywords

Cite

@article{arxiv.1204.1912,
  title  = {Reference Based Genome Compression},
  author = {Bobbie Chern and Idoia Ochoa and Alexandros Manolakos and Albert No and Kartik Venkat and Tsachy Weissman},
  journal= {arXiv preprint arXiv:1204.1912},
  year   = {2016}
}

Comments

5 pages; Submitted to the IEEE Information Theory Workshop (ITW) 2012

R2 v1 2026-06-21T20:46:42.474Z