Related papers: Genomic Compression with Read Alignment at the Dec…

Genetic Sequence compression using Machine Learning and Arithmetic Encoding Decoding Techniques

We live in a period where bio-informatics is rapidly expanding, a significant quantity of genomic data has been produced as a result of the advancement of high-throughput genome sequencing technology, raising concerns about the costs…

Quantitative Methods · Quantitative Biology 2023-03-10 Mehedi Hasan Sarkar , Adnan Ferdous Ashrafi

Efficient Compression of Long Arbitrary Sequences with No Reference at the Encoder

In a distributed information application an encoder compresses an arbitrary vector while a similar reference vector is available to the decoder as side information. For the Hamming-distance similarity measure, and when guaranteed perfect…

Information Theory · Computer Science 2020-09-08 Yuval Cassuto , Jacob Ziv

Reference Based Genome Compression

DNA sequencing technology has advanced to a point where storage is becoming the central bottleneck in the acquisition and mining of more data. Large amounts of data are vital for genomics research, and generic compression tools, while…

Information Theory · Computer Science 2016-11-15 Bobbie Chern , Idoia Ochoa , Alexandros Manolakos , Albert No , Kartik Venkat , Tsachy Weissman

AMGC: Adaptive match-based genomic compression algorithm

Motivation: Despite significant advances in Third-Generation Sequencing (TGS) technologies, Next-Generation Sequencing (NGS) technologies remain dominant in the current sequencing market. This is due to the lower error rates and richer…

Information Theory · Computer Science 2023-04-04 Jia Wang , Yi Niu , Tianyi Xu , Mingming Ma , Dahua Gao , Guangming Shi

Reference Sequence Construction for Relative Compression of Genomes

Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform…

Quantitative Methods · Quantitative Biology 2011-06-21 Shanika Kuruppu , Simon Puglisi , Justin Zobel

Engineering Relative Compression of Genomes

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an…

Computational Engineering, Finance, and Science · Computer Science 2011-03-14 Szymon Grabowski , Sebastian Deorowicz

Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes

We present a Compression Tool, "GenBit Compress", for genetic sequences based on our new proposed "GenBit Compress Algorithm". Our Tool achieves the best compression ratios for Entire Genome (DNA sequences) . Significantly better…

Mathematical Software · Computer Science 2010-07-15 P. Raja Rajeswari , Allam Apparo , V. K. Kumar

Compression of high throughput sequencing data with probabilistic de Bruijn graph

Motivation: Data volumes generated by next-generation sequencing technolo- gies is now a major concern, both for storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as…

Data Structures and Algorithms · Computer Science 2014-12-19 Gaëtan Benoit , Claire Lemaitre , Dominique Lavenier , Guillaume Rizk

A Compression Algorithm Using Mis-aligned Side-information

We study the problem of compressing a source sequence in the presence of side-information that is related to the source via insertions, deletions and substitutions. We propose a simple algorithm to compress the source sequence when the…

Information Theory · Computer Science 2016-11-15 Nan Ma , Kannan Ramchandran , David Tse

GeneFormer: Learned Gene Compression using Transformer-based Context Modeling

With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like…

Machine Learning · Computer Science 2023-02-01 Zhanbei Cui , Yu Liao , Tongda Xu , Yan Wang

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being…

Data Structures and Algorithms · Computer Science 2015-03-20 Anthony J. Cox , Markus J. Bauer , Tobias Jakobi , Giovanna Rosone

DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information…

Data Structures and Algorithms · Computer Science 2011-09-05 Heba Afify , Muhammad Islam , Manal Abdel Wahed

On Large Scale Distributed Compression and Dispersive Information Routing for Networks

This paper considers the problem of distributed source coding for a large network. A major obstacle that poses an existential threat to practical deployment of conventional approaches to distributed coding is the exponential growth of the…

Information Theory · Computer Science 2013-01-08 Kumar Viswanatha , Sharadh Ramaswamy , Ankur Saxena , Emrah Akyol , Kenneth Rose

Learning Genomic Structure from $k$-mers

Sequencing a genome to determine an individual's DNA produces an enormous number of short nucleotide subsequences known as reads, which must be reassembled to reconstruct the full genome. We present a method for analyzing this type of data…

Machine Learning · Computer Science 2025-05-23 Filip Thor , Carl Nettelblad

Coding for Polymer-Based Data Storage

Motivated by polymer-based data-storage platforms that use chains of binary synthetic polymers as the recording media and read the content via tandem mass spectrometers, we propose a new family of codes that allows for both unique string…

Information Theory · Computer Science 2021-06-29 Srilakshmi Pattabiraman , Ryan Gabrys , Olgica Milenkovic

Deep Image Compression using Decoder Side Information

We present a Deep Image Compression neural network that relies on side information, which is only available to the decoder. We base our algorithm on the assumption that the image available to the encoder and the image available to the…

Computer Vision and Pattern Recognition · Computer Science 2020-07-30 Sharon Ayzik , Shai Avidan

Single-Read Reconstruction for DNA Data Storage Using Transformers

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential…

Emerging Technologies · Computer Science 2021-10-12 Yotam Nahum , Eyar Ben-Tolila , Leon Anavy

Pangenome-guided sequence assembly via binary optimisation

De novo genome assembly is challenging in highly repetitive regions; however, reference-guided assemblers often suffer from bias. We propose a framework for pangenome-guided sequence assembly, which can resolve short-read data in complex…

Quantum Physics · Physics 2026-02-11 Josh Cudby , James Bonfield , Chenxi Zhou , Richard Durbin , Sergii Strelchuk

Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be…

Data Structures and Algorithms · Computer Science 2014-09-19 Szymon Grabowski , Sebastian Deorowicz , Łukasz Roguski

Quorum Sensing for Regenerating Codes in Distributed Storage

Distributed storage systems with replication are well known for storing large amount of data. A large number of replication is done in order to provide reliability. This makes the system expensive. Various methods have been proposed over…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-01 Mit Sheth , Krishna Gopal Benerjee , Manish K. Gupta