Related papers: Data compression and genomes: a two dimensional li…
Much evolutionary information is stored in the fluctuations of protein length distributions. The genome size and non-coding DNA content can be calculated based only on the protein length distributions. So there is intrinsic relationship…
We live in a period where bio-informatics is rapidly expanding, a significant quantity of genomic data has been produced as a result of the advancement of high-throughput genome sequencing technology, raising concerns about the costs…
Biological data mainly comprises of Deoxyribonucleic acid (DNA) and protein sequences. These are the biomolecules which are present in all cells of human beings. Due to the self-replicating property of DNA, it is a key constitute of genetic…
Complexity metrics and machine learning (ML) models have been utilized to analyze the lengths of segmental genomic entities like: exons, introns, intergenic and repeat/unique DNA sequences, in each of the 22 human chromosomes. The purpose…
By using the Jensen-Shannon divergence, genomic DNA can be divided into compositionally distinct domains through a standard recursive segmentation procedure. Each domain, while significantly different from its neighbours, may however share…
Non protein coding regions of the human genome contain many complex patterns which regulate the cellular activity. Studying the human genome is limited by the lack of understanding of its features and their complex interactions. However,…
We introduce a novel method to analyse complete genomes and recognise some distinctive features by means of an adaptive compression algorithm, which is not DNA-oriented. We study the Information Content as a function of the number of…
Being able to store and transmit human genome sequences is an important part in genomic research and industrial applications. The complete human genome has 3.1 billion base pairs (haploid), and storing the entire genome naively takes about…
We introduce a method to estimate the complexity function of symbolic dynamical systems from a finite sequence of symbols. We test such complexity estimator on several symbolic dynamical systems whose complexity functions are known exactly.…
We introduce a complexity measure for symbolic sequences. Starting from a segmentation procedure of the sequence, we define its complexity as the entropy of the distribution of lengths of the domains of relatively uniform composition in…
The fall of prices of the high-throughput genome sequencing changes the landscape of modern genomics. A number of large scale projects aimed at sequencing many human genomes are in progress. Genome sequencing also becomes an important aid…
The classification of DNA sequences is a key research area in bioinformatics as it enables researchers to conduct genomic analysis and detect possible diseases. In this paper, three state-of-the-art algorithms, namely Convolutional Neural…
Modern biological science produces vast amounts of genomic sequence data. This is fuelling the need for efficient algorithms for sequence compression and analysis. Data compression and the associated techniques coming from information…
The unconstrained genomic DNA of bacteria forms a coil, which volume exceeds 1000 times the volume of the cell. Since prokaryotes lack a membrane-bound nucleus, in sharp contrast with eukaryotes, the DNA may consequently be expected to…
Genome rearrangement has been an active area of research in computational comparative genomics for the last three decades. While initially mostly an interesting algorithmic endeavor, now the practical application by applying rearrangement…
Capturing the physical organisation and dynamics of genomic regions is one of the major open challenges in biology. The kinetoplast DNA (kDNA) is a topologically complex genome, made by thousands of DNA (mini and maxi) circles interlinked…
The problem of differentiating the informational content of coding (exons) and non-coding (introns) regions of a DNA sequence is one of the central problems of genomics. The introns are estimated to be nearly 95% of the DNA and since they…
Four chapters of the synthesis represent four major areas of my research interests: 1) data analysis in molecular biology, 2) mathematical modeling of biological networks, 3) genome evolution, and 4) cancer systems biology. The first…
Advances in DNA sequencing technology will soon result in databases of thousands of genomes. Within a species, individuals' genomes are almost exact copies of each other; e.g., any two human genomes are 99.9% the same. Relative Lempel-Ziv…
While achieving a compression ratio of 2.0 bits/base, the new algorithm codes non-N bases in fixed length. It dramatically reduces the time of coding and decoding than previous DNA compression algorithms and some universal compression…