Related papers: A new distance between DNA sequences
When the process underlying DNA substitutions varies across evolutionary history, the standard Markov models underlying standard phylogenetic methods are mathematically inconsistent. The most prominent example is the general time reversible…
Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial…
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…
The new wave of successful generative models in machine learning has increased the interest in deep learning driven de novo drug design. However, assessing the performance of such generative models is notoriously difficult. Metrics that are…
Under a markovian evolutionary process, the expected number of substitutions per site (also called branch length) that have occurred when a sequence has evolved from another according to a transition matrix $P$ can be approximated by…
Surrogate models are a well established approach to reduce the number of expensive function evaluations in continuous optimization. In the context of genetic programming, surrogate modeling still poses a challenge, due to the complex…
Distances between sequences based on their $k$-mer frequency counts can be used to reconstruct phylogenies without first computing a sequence alignment. Past work has shown that effective use of k-mer methods depends on 1) model-based…
We consider models of nucleotidic substitution processes where the rate of substitution at a given site depends on the state of its neighbours. For a wide class of such nonreversible models, we show how to compute consistent, mathematically…
Phylogenetic tree reconstruction is traditionally based on multiple sequence alignments (MSAs) and heavily depends on the validity of this information bottleneck. With increasing sequence divergence, the quality of MSAs decays quickly.…
A mathematical algorithm to describe DNA or RNA sequences of $N$ nucleotides by a string of $2N$ integers numbers is presented in the framework of the so called crystal basis model of the genetic code. The description allows to define a not…
Phylogenetic networks can represent evolutionary events that cannot be described by phylogenetic trees. These networks are able to incorporate reticulate evolutionary events such as hybridization, introgression, and lateral gene transfer.…
Models like support vector machines or Gaussian process regression often require positive semi-definite kernels. These kernels may be based on distance functions. While definiteness is proven for common distances and kernels, a proof for a…
The edit distance under the DCJ model can be computed in linear time for genomes with equal content or with Indels. But it becomes NP-Hard in the presence of duplications, a problem largely unsolved especially when Indels are considered. In…
The log-det distance between two aligned DNA sequences was introduced as a tool for statistically consistent inference of a gene tree under simple non-mixture models of sequence evolution. Here we prove that the log-det distance, coupled…
Given a set of sequences, the distance between pairs of them helps us to find their similarity and derive structural relationship amongst them. For genomic sequences such measures make it possible to construct the evolution tree of…
This paper introduces the Gene Mover's Distance, a measure of similarity between a pair of cells based on their gene expression profiles obtained via single-cell RNA sequencing. The underlying idea of the proposed distance is to interpret…
Generative models are invaluable in many fields of science because of their ability to capture high-dimensional and complicated distributions, such as photo-realistic images, protein structures, and connectomes. How do we evaluate the…
Distances between probability distributions are a key component of many statistical machine learning tasks, from two-sample testing to generative modeling, among others. We introduce a novel distance between measures that compares them…
Tools that effectively analyze and compare sequences are of great importance in various areas of applied computational research, especially in the framework of molecular biology. In the present paper, we introduce simple geometric criteria…
Generative models have achieved remarkable success across a range of applications, yet their evaluation still lacks principled uncertainty quantification. In this paper, we develop a method for comparing how close different generative…