Related papers: Genomic Compression with Read Alignment at the Dec…
We live in a period where bio-informatics is rapidly expanding, a significant quantity of genomic data has been produced as a result of the advancement of high-throughput genome sequencing technology, raising concerns about the costs…
In a distributed information application an encoder compresses an arbitrary vector while a similar reference vector is available to the decoder as side information. For the Hamming-distance similarity measure, and when guaranteed perfect…
DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while…
Motivation: Despite significant advances in Third-Generation Sequencing (TGS) technologies, Next-Generation Sequencing (NGS) technologies remain dominant in the current sequencing market. This is due to the lower error rates and richer…
Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform…
Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an…
We present a Compression Tool, "GenBit Compress", for genetic sequences based on our new proposed "GenBit Compress Algorithm". Our Tool achieves the best compression ratios for Entire Genome (DNA sequences) . Significantly better…
Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as…
We study the problem of compressing a source sequence in the presence of side-information that is related to the source via insertions, deletions and substitutions. We propose a simple algorithm to compress the source sequence when the…
With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like…
Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being…
Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information…
This paper considers the problem of distributed source coding for a large network. A major obstacle that poses an existential threat to practical deployment of conventional approaches to distributed coding is the exponential growth of the…
Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data…
Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string…
We present a Deep Image Compression neural network that relies on side information, which is only available to the decoder. We base our algorithm on the assumption that the image available to the encoder and the image available to the…
As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential…
De novo genome assembly is challenging in highly repetitive regions; however, reference-guided assemblers often suffer from bias. We propose a framework for pangenome-guided sequence assembly, which can resolve short-read data in complex…
Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be…
Distributed storage systems with replication are well known for storing large amount of data. A large number of replication is done in order to provide reliability. This makes the system expensive. Various methods have been proposed over…