English

Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage

Information Theory 2020-01-10 v1 math.IT

Abstract

We propose coding techniques that limit the length of homopolymers runs, ensure the GC-content constraint, and are capable of correcting a single edit error in strands of nucleotides in DNA-based data storage systems. In particular, for given ,ϵ>0\ell, {\epsilon} > 0, we propose simple and efficient encoders/decoders that transform binary sequences into DNA base sequences (codewords), namely sequences of the symbols A, T, C and G, that satisfy the following properties: (i) Runlength constraint: the maximum homopolymer run in each codeword is at most \ell, (ii) GC-content constraint: the GC-content of each codeword is within [0.5ϵ,0.5+ϵ][0.5-{\epsilon}, 0.5+{\epsilon}], (iii) Error-correction: each codeword is capable of correcting a single deletion, or single insertion, or single substitution error. For practical values of \ell and ϵ{\epsilon}, we show that our encoders achieve much higher rates than existing results in the literature and approach the capacity. Our methods have low encoding/decoding complexity and limited error propagation.

Keywords

Cite

@article{arxiv.2001.02839,
  title  = {Capacity-Approaching Constrained Codes with Error Correction for DNA-Based Data Storage},
  author = {Tuan Thanh Nguyen and Kui Cai and Kees A. Schouhamer Immink and Han Mao Kiah},
  journal= {arXiv preprint arXiv:2001.02839},
  year   = {2020}
}