Related papers: Approximate word matches between two random sequen…
Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between…
This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are…
Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all…
The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable…
We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the…
In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions.…
In this work, we consider the problem of pattern matching under the dynamic time warping (DTW) distance motivated by potential applications in the analysis of biological data produced by the third generation sequencing. To measure the DTW…
Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold {\em et al.} (2009) showed how the average number of substitutions between two DNA…
A sequence of geometric random variables of length $n$ is a sequence of $n$ independent and identically distributed geometric random variables ($\Gamma_1, \Gamma_2, \dots, \Gamma_n$) where $\mathbb{P}(\Gamma_j=i)=pq^{i-1}$ for…
In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance…
The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows: Let $D=\{x_1,x_2,\ldots x_d\}$ be a set of $d$ words defined on an alphabet…
The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…
The purpose of this paper is to study a statistic that is used to compare the similarity between two strings, which is first introduced by Michael Steele in 1982. It was proposed as an alternative to the length of the longest common…
A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same…
We investigate the variance of the length of the longest common subsequences of two independent random words of size $n$, where the letters of one word are i.i.d. uniformly drawn from $\{\alpha_1, \alpha_2, \cdots, \alpha_m\}$, while the…
Dependency distance minimization (DDm) is a word order principle favouring the placement of syntactically related words close to each other in sentences. Massive evidence of the principle has been reported for more than a decade with the…
In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat…
In the Dictionary-based String Matching (DSM) problem, a retrieval system has access to a source sequence and stores the position of a certain number of strings in a posting table. When a user inquires the position of a string, the…
Described are two algorithms to find long approximate palindromes in a string, for example a DNA sequence. A simple algorithm requires O(n)-space and almost always runs in $O(k.n)$-time where n is the length of the string and k is the…
Algorithmic statistics considers the following problem: given a binary string $x$ (e.g., some experimental data), find a "good" explanation of this data. It uses algorithmic information theory to define formally what is a good explanation.…