English

Coded Shotgun Sequencing

Information Theory 2022-02-09 v2 math.IT Applications

Abstract

Most DNA sequencing technologies are based on the shotgun paradigm: many short reads are obtained from random unknown locations in the DNA sequence. A fundamental question, studied in arXiv:1203.6233, is what read length and coverage depth (i.e., the total number of reads) are needed to guarantee reliable sequence reconstruction. Motivated by DNA-based storage, we study the coded version of this problem;i.e., the scenario where the DNA molecule being sequenced is a codeword from a predefined codebook. Our main result is an exact characterization of the capacity of the resulting shotgun sequencing channel as a function of the read length and coverage depth. In particular, our results imply that, while in the uncoded case, O(n)O(n) reads of length greater than 2logn2\log{n} are needed for reliable reconstruction of a length-nn binary sequence, in the coded case, only O(n/logn)O(n/\log{n}) reads of length greater than logn\log{n} are needed for the capacity to be arbitrarily close to 11.

Keywords

Cite

@article{arxiv.2110.02868,
  title  = {Coded Shotgun Sequencing},
  author = {Aditya Narayan Ravi and Alireza Vahid and Ilan Shomorony},
  journal= {arXiv preprint arXiv:2110.02868},
  year   = {2022}
}

Comments

35 pages, 4 figures, 8 appendices