Related papers: Phylogenetic distances for neighbour dependent sub…
The presence of neighbor dependencies generated a specific pattern of dinucleotide frequencies in all organisms. Especially, the CpG-methylation-deamination process is the predominant substitution process in vertebrates and needs to be…
We prove that a wide class of models of Markov neighbor-dependent substitution processes on the integer line is solvable. This class contains some models of nucleotide substitutions recently introduced and studied empirically by molecular…
We propose a new distance metric for DNA sequences, which can be defined on any evolutionary Markov model with infinitesimal generator matrix Q. That is the new metric can be defined under existing models such as Jukes-Cantor model,…
We introduce a model of DNA sequence evolution which can account for biases in mutation rates that depend on the identity of the neighboring bases. An analytic solution for this class of non-equilibrium models is developed by adopting…
In this paper, we apply conformal prediction to time series data. Conformal prediction isa method that produces predictive regions given a confidence level. The regions outputs arealways valid under the exchangeability assumption. However,…
This paper addresses the estimation of locally stationary long-range dependent processes, a methodology that allows the statistical analysis of time series data exhibiting both nonstationarity and strong dependency. A time-varying…
Distances between sequences based on their $k$-mer frequency counts can be used to reconstruct phylogenies without first computing a sequence alignment. Past work has shown that effective use of k-mer methods depends on 1) model-based…
Accurate estimation of evolutionary distances between taxa is important for many phylogenetic reconstruction methods. In the case of bacteria, distances can be estimated using a range of different evolutionary models, from single nucleotide…
This article proposes a novel approach to statistical alignment of nucleotide sequences by introducing a context dependent structure on the substitution process in the underlying evolutionary model. We propose to estimate alignments and…
We consider the problem of distance estimation under the TKF91 model of sequence evolution by insertions, deletions and substitutions on a phylogeny. In an asymptotic regime where the expected sequence lengths tend to infinity, we show that…
Modelling the substitution of nucleotides along a phylogenetic tree is usually done by a hidden Markov process. This allows to define a distribution of characters at the leaves of the trees and one might be able to obtain polynomial…
This paper proposes an extension to conventional regression Neural Networks (NNs) for replacing the point predictions they produce with prediction intervals that satisfy a required level of confidence. Our approach follows a novel machine…
Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold {\em et al.} (2009) showed how the average number of substitutions between two DNA…
We define two minimum distance estimators for dependent data by minimizing some approximated Maximum Mean Discrepancy distances between the true empirical distribution of observations and their assumed (parametric) model distribution. When…
Let $(X_i)_{i=1,...,n}$ be a possibly nonstationary sequence such that $\mathscr{L}(X_i)=P_n$ if $i\leq n\theta$ and $\mathscr{L}(X_i)=Q_n$ if $i>n\theta$, where $0<\theta <1$ is the location of the change-point to be estimated. We…
We study the problem of estimating the mutation rate between two sequences from noisy sequencing reads. Existing alignment-free methods typically assume direct access to the full sequences. We extend these methods to the sequencing…
Quantifying uncertainty in automatically generated text is important for letting humans check potential hallucinations and making systems more reliable. Conformal prediction is an attractive framework to provide predictions imbued with…
Inferring the phylogenetic relationships among a sample of organisms is a fundamental problem in modern biology. While distance-based hierarchical clustering algorithms achieved early success on this task, these have been supplanted by…
Pathogen genome data offers valuable structure for spatial models, but its utility is limited by incomplete sequencing coverage. We propose a probabilistic framework for inferring genetic distances between unsequenced cases and known…
We extend in two directions our previous results about the sampling and the empirical measures of immortal branching Markov processes. Direct applications to molecular biology are rigorous estimates of the mutation rates of polymerase chain…