Related papers: On the Reverse-Complement String-Duplication Syste…
Motivated by applications in in-vivo DNA storage, we study codes for correcting duplications. A reverse-complement duplication of length $k$ is the insertion of the reversed and complemented copy of a substring of length $k$ adjacent to its…
We derive the coding capacity for duplication-correcting codes capable of correcting any number of duplications. We do so both for reverse-complement duplications, as well as palindromic (reverse) duplications. We show that except for…
Motivated by applications in DNA storage, we study a setting in which strings are affected by tandem-duplication errors. In particular, we look at two settings: disjoint tandem-duplication errors, and equal-length tandem-duplication errors.…
A (tandem) duplication of length $ k $ is an insertion of an exact copy of a substring of length $ k $ next to its original position. This and related types of impairments are of relevance in modeling communication in the presence of…
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous…
We study random string-duplication systems, which we call P\'olya string models. These are motivated by DNA storage in living organisms, and certain random mutation processes that affect their genome. Unlike previous works that study the…
The problem of reconstructing strings from their substring spectra has a long history and in its most simple incarnation asks for determining under which conditions the spectrum uniquely determines the string. We study the problem of coded…
This paper introduces a new family of reconstruction codes which is motivated by applications in DNA data storage and sequencing. In such applications, DNA strands are sequenced by reading some subset of their substrings. While previous…
It is known that the majority of the human genome consists of repeated sequences. Furthermore, it is believed that a significant part of the rest of the genome also originated from repeated sequences and has mutated to its current form. In…
Motivated by applications in polymer-based data storage, we study the problem of reconstructing a string from part of its composition multiset. We give a full description of the structure of the strings that cannot be uniquely reconstructed…
The binary string matching problem consists in finding all the occurrences of a pattern in a text where both strings are built on a binary alphabet. This is an interesting problem in computer science, since binary data are omnipresent in…
Exact string matching has been a fundamental problem in computer science for decades because of many practical applications. Some are related to common procedures, such as searching in files and text editors, or, more recently, to more…
Tandem duplication is the process of inserting a copy of a segment of DNA adjacent to the original position. Motivated by applications that store data in living organisms, Jain et al. (2017) proposed the study of codes that correct tandem…
Due to its high data density and longevity, DNA is considered a promising medium for satisfying ever-increasing data storage needs. However, the diversity of errors that occur in DNA sequences makes efficient error-correction a challenging…
The problem of correcting deletions has received significant attention, partly because of the prevalence of these errors in DNA data storage. In this paper, we study the problem of correcting a consecutive burst of at most $t$ deletions in…
The ability to store data in the DNA of a living organism has applications in a variety of areas including synthetic biology and watermarking of patented genetically-modified organisms. Data stored in this medium is subject to errors…
To increase the information capacity of DNA storage, composite DNA letters were introduced. We propose a novel channel model for composite DNA in which composite sequences are decomposed into ordered standard non-composite sequences. The…
The set of all $ q $-ary strings that do not contain repeated substrings of length $ \leqslant\! 3 $ (i.e., that do not contain substrings of the form $ a a $, $ a b a b $, and $ a b c a b c $) constitutes a code correcting an arbitrary…
The problem of RNA secondary structure design (also called inverse folding) is the following: given a target secondary structure, one aims to create a sequence that folds into, or is compatible with, a given structure. In several practical…
Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more time efficient and are thus widely used in large scale storage…