Related papers: Reference Based Genome Compression

Engineering Relative Compression of Genomes

Technology progress in DNA sequencing boosts the genomic database growth at faster and faster rate. Compression, accompanied with random access capabilities, is the key to maintain those huge amounts of data. In this paper we present an…

Computational Engineering, Finance, and Science · Computer Science 2011-03-14 Szymon Grabowski , Sebastian Deorowicz

Genome Compression Against a Reference

Being able to store and transmit human genome sequences is an important part in genomic research and industrial applications. The complete human genome has 3.1 billion base pairs (haploid), and storing the entire genome naively takes about…

Genomics · Quantitative Biology 2020-10-07 Anirduddha Laud , Gaurav Menghani , Madhava Keralapura

Reference Sequence Construction for Relative Compression of Genomes

Relative compression, where a set of similar strings are compressed with respect to a reference string, is a very effective method of compressing DNA datasets containing multiple similar sequences. Relative compression is fast to perform…

Quantitative Methods · Quantitative Biology 2011-06-21 Shanika Kuruppu , Simon Puglisi , Justin Zobel

Genetic Sequence compression using Machine Learning and Arithmetic Encoding Decoding Techniques

We live in a period where bio-informatics is rapidly expanding, a significant quantity of genomic data has been produced as a result of the advancement of high-throughput genome sequencing technology, raising concerns about the costs…

Quantitative Methods · Quantitative Biology 2023-03-10 Mehedi Hasan Sarkar , Adnan Ferdous Ashrafi

GDC 2: Compression of large collections of genomes

The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid…

Data Structures and Algorithms · Computer Science 2017-03-03 Sebastian Deorowicz , Agnieszka Danek , Marcin Niemiec

Genbit Compress Tool(GBC): A Java-Based Tool to Compress DNA Sequences and Compute Compression Ratio(bits/base) of Genomes

We present a Compression Tool, "GenBit Compress", for genetic sequences based on our new proposed "GenBit Compress Algorithm". Our Tool achieves the best compression ratios for Entire Genome (DNA sequences) . Significantly better…

Mathematical Software · Computer Science 2010-07-15 P. Raja Rajeswari , Allam Apparo , V. K. Kumar

FastqZip: An Improved Reference-Based Genome Sequence Lossy Compression Framework

Storing and archiving data produced by next-generation sequencing (NGS) is a huge burden for research institutions. Reference-based compression algorithms are effective in dealing with these data. Our work focuses on compressing FASTQ…

Information Theory · Computer Science 2024-04-04 Yuanjian Liu , Huihao Luo , Zhijun Han , Yao Hu , Yehui Yang , Kyle Chard , Sheng Di , Ian Foster , Jiesheng Wu

Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be…

Data Structures and Algorithms · Computer Science 2014-09-19 Szymon Grabowski , Sebastian Deorowicz , Łukasz Roguski

Genomic Compression with Read Alignment at the Decoder

We propose a new compression scheme for genomic data given as sequence fragments called reads. The scheme uses a reference genome at the decoder side only, freeing the encoder from the burdens of storing references and performing…

Information Theory · Computer Science 2023-02-10 Yotam Gershon , Yuval Cassuto

Analysis of Compression Techniques for DNA Sequence Data

Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic…

Other Quantitative Biology · Quantitative Biology 2020-06-04 Shakeela Bibi , Javed Iqbal , Adnan Iftekhar , Mir Hassan

Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform

Motivation The Burrows-Wheeler transform (BWT) is the foundation of many algorithms for compression and indexing of text data, but the cost of computing the BWT of very large string collections has prevented these techniques from being…

Data Structures and Algorithms · Computer Science 2015-03-20 Anthony J. Cox , Markus J. Bauer , Tobias Jakobi , Giovanna Rosone

A Fixed-Length Coding Algorithm for DNA Sequence Compression

While achieving a compression ratio of 2.0 bits/base, the new algorithm codes non-N bases in fixed length. It dramatically reduces the time of coding and decoding than previous DNA compression algorithms and some universal compression…

Information Theory · Computer Science 2007-07-16 Jie Liu , Sheng Bao , Zhiqiang Jing , Shi Chen

Quantum gate algorithm for reference-guided DNA sequence alignment

Reference-guided DNA sequencing and alignment is an important process in computational molecular biology. The amount of DNA data grows very fast, and many new genomes are waiting to be sequenced while millions of private genomes need to be…

Biomolecules · Quantitative Biology 2023-09-19 G. D. Varsamis , I. G. Karafyllidis , K. M. Gilkes , U. Arranz , R. Martin-Cuevas , G. Calleja , P. Dimitrakis , P. Kolovos , R. Sandaltzopoulos , H. C. Jessen , J. Wong

DNA Lossless Differential Compression Algorithm based on Similarity of Genomic Sequence Database

Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information…

Data Structures and Algorithms · Computer Science 2011-09-05 Heba Afify , Muhammad Islam , Manal Abdel Wahed

A Compressed Self-Index for Genomic Databases

Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv…

Data Structures and Algorithms · Computer Science 2011-11-08 Travis Gagie , Juha Kärkkäinen , Yakov Nekrich , Simon J. Puglisi

A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute the BWT directly. Our motivation is to perform in succinct space genomic analyses that require complex string queries not yet supported by repetition-based…

Data Structures and Algorithms · Computer Science 2020-11-17 Diego Díaz-Domínguez , Gonzalo Navarro

An Efficient Biological Sequence Compression Technique Using LUT And Repeat In The Sequence

Data compression plays an important role to deal with high volumes of DNA sequences in the field of Bioinformatics. Again data compression techniques directly affect the alignment of DNA sequences. So the time needed to decompress a…

Computational Engineering, Finance, and Science · Computer Science 2012-11-13 Subhankar Roy , Sunirmal Khatua , Sudipta Roy , Samir K. Bandyopadhyay

GeneFormer: Learned Gene Compression using Transformer-based Context Modeling

With the development of gene sequencing technology, an explosive growth of gene data has been witnessed. And the storage of gene data has become an important issue. Traditional gene data compression methods rely on general software like…

Machine Learning · Computer Science 2023-02-01 Zhanbei Cui , Yu Liao , Tongda Xu , Yan Wang

A DNA Sequence Compression Algorithm Based on LUT and LZ77

This article introduces a new DNA sequence compression algorithm which is based on LUT and LZ77 algorithm. Combined a LUT-based pre-coding routine and LZ77 compression routine,this algorithm can approach a compression ratio of 1.9bits…

Information Theory · Computer Science 2007-07-16 Sheng Bao , Shi Chen , Zhiqiang Jing , Ran Ren

AMGC: Adaptive match-based genomic compression algorithm

Motivation: Despite significant advances in Third-Generation Sequencing (TGS) technologies, Next-Generation Sequencing (NGS) technologies remain dominant in the current sequencing market. This is due to the lower error rates and richer…

Information Theory · Computer Science 2023-04-04 Jia Wang , Yi Niu , Tianyi Xu , Mingming Ma , Dahua Gao , Guangming Shi