Alignment-free sequence comparison using absent words

Panagiotis Charalampopoulos; Maxime Crochemore; Gabriele Fici; Robert Mercas; Solon P. Pissis

Alignment-free sequence comparison using absent words

Data Structures and Algorithms 2018-06-08 v1 Formal Languages and Automata Theory

Authors: Panagiotis Charalampopoulos , Maxime Crochemore , Gabriele Fici , Robert Mercas , Solon P. Pissis

Abstract

Sequence comparison is a prerequisite to virtually all comparative genomic analyses. It is often realised by sequence alignment techniques, which are computationally expensive. This has led to increased research into alignment-free techniques, which are based on measures referring to the composition of sequences in terms of their constituent patterns. These measures, such as $q$ -gram distance, are usually computed in time linear with respect to the length of the sequences. In this paper, we focus on the complementary idea: how two sequences can be efficiently compared based on information that does not occur in the sequences. A word is an {\em absent word} of some sequence if it does not occur in the sequence. An absent word is {\em minimal} if all its proper factors occur in the sequence. Here we present the first linear-time and linear-space algorithm to compare two sequences by considering {\em all} their minimal absent words. In the process, we present results of combinatorial interest, and also extend the proposed techniques to compare circular sequences. We also present an algorithm that, given a word $x$ of length $n$ , computes the largest integer for which all factors of $x$ of that length occur in some minimal absent word of $x$ in time and space $\cO(n)$ . Finally, we show that the known asymptotic upper bound on the number of minimal absent words of a word is tight.

Keywords

string algorithms similarity search sequence design

Cite

@article{arxiv.1806.02718,
  title  = {Alignment-free sequence comparison using absent words},
  author = {Panagiotis Charalampopoulos and Maxime Crochemore and Gabriele Fici and Robert Mercas and Solon P. Pissis},
  journal= {arXiv preprint arXiv:1806.02718},
  year   = {2018}
}

Comments

Extended version of "Linear-Time Sequence Comparison Using Minimal Absent Words & Applications" Proc. LATIN 2016, arxiv:1506.04917

Alignment-free sequence comparison using absent words

Abstract

Keywords

Cite

Comments

Related papers