Related papers: Approximate word matches between two random sequen…

Characterising the D2 statistic: word matches in biological sequences

Word matches are often used in sequence comparison methods, either as a measure of sequence similarity or in the first search steps of algorithms such as BLAST or BLAT. The D2 statistic is the number of matches of words of k letters between…

Quantitative Methods · Quantitative Biology 2009-09-09 Sylvain Foret , Susan R. Wilson , Conrad J. Burden

Empirical distribution of k-word matches in biological sequences

This study focuses on an alignment-free sequence comparison method: the number of words of length k shared between two sequences, also known as the D_2 statistic. The advantages of the use of this statistic over alignment-based methods are…

Quantitative Methods · Quantitative Biology 2009-09-08 Sylvain Foret , Susan R. Wilson , Conrad J. Burden

Spectral Analysis of Word Statistics

Given a random text over a finite alphabet, we study the frequencies at which fixed-length words occur as subsequences. As the data size grows, the joint distribution of word counts exhibits a rich asymptotic structure. We investigate all…

Probability · Mathematics 2026-05-06 Chaim Even-Zohar , Tsviqa Lakrec , Ran J. Tessler

Faster two-dimensional pattern matching with $k$ mismatches

The classical pattern matching asks for locating all occurrences of one string, called the pattern, in another, called the text, where a string is simply a sequence of characters. Due to the potential practical applications, it is desirable…

Data Structures and Algorithms · Computer Science 2024-10-30 Jonas Ellert , Paweł Gawrychowski , Adam Górkiewicz , Tatiana Starikovskaya

Hidden Words Statistics for Large Patterns

We study here the so called subsequence pattern matching also known as hidden pattern matching in which one searches for a given pattern $w$ of length $m$ as a subsequence in a random text of length $n$. The quantity of interest is the…

Probability · Mathematics 2020-03-24 Svante Janson , Wojciech Szpankowski

In this work we explore the dissimilarity between symmetric word pairs, by comparing the inter-word distance distribution of a word to that of its reversed complement. We propose a new measure of dissimilarity between such distributions.…

Applications · Statistics 2021-01-13 Ana Helena Tavares , Jakob Raymaekers , Peter J. Rousseeuw , Raquel M. Silva , Carlos A. C. Bastos , Armando Pinho , Paula Brito , Vera Afreixo

Pattern matching under DTW distance

In this work, we consider the problem of pattern matching under the dynamic time warping (DTW) distance motivated by potential applications in the analysis of biological data produced by the third generation sequencing. To measure the DTW…

Data Structures and Algorithms · Computer Science 2022-09-01 Garance Gourdel , Anne Driemel , Pierre Peterlongo , Tatiana Starikovskaya

Estimating phylogenetic distances between genomic sequences based on the length distribution of k-mismatch common substrings

Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between two input sequences. Haubold {\em et al.} (2009) showed how the average number of substitutions between two DNA…

Populations and Evolution · Quantitative Biology 2017-09-06 Burkhard Morgenstern , Svenja Schöbel , Chris-André Leimeister

The number of distinct adjacent pairs in geometrically distributed words

A sequence of geometric random variables of length $n$ is a sequence of $n$ independent and identically distributed geometric random variables ($\Gamma_1, \Gamma_2, \dots, \Gamma_n$) where $\mathbb{P}(\Gamma_j=i)=pq^{i-1}$ for…

Combinatorics · Mathematics 2023-06-22 Margaret Archibald , Aubrey Blecher , Charlotte Brennan , Arnold Knopfmacher , Stephan Wagner , Mark Ward

Comparing reverse complementary genomic words based on their distance distributions and frequencies

In this work we study reverse complementary genomic word pairs in the human DNA, by comparing both the distance distribution and the frequency of a word to those of its reverse complement. Several measures of dissimilarity between distance…

Applications · Statistics 2021-01-13 Ana Helena Tavares , Jakob Raymaekers , Peter Rousseeuw , Raquel M. Silva , Carlos A. C. Bastos , Armando Pinho , Paula Brito , Vera Afreixo

Approximate String Matching: Theory and Applications (La Recherche Approch\'ee de Motifs : Th\'eorie et Applications)

The approximate string matching is a fundamental and recurrent problem that arises in most computer science fields. This problem can be defined as follows: Let $D=\{x_1,x_2,\ldots x_d\}$ be a set of $d$ words defined on an alphabet…

Data Structures and Algorithms · Computer Science 2017-01-31 Ibrahim Chegrane

Exact Probability Distribution versus Entropy

The problem addressed concerns the determination of the average number of successive attempts of guessing a word of a certain length consisting of letters with given probabilities of occurrence. Both first- and second-order approximations…

Information Theory · Computer Science 2015-06-19 Kerstin Andersson

On an alternative sequence comparison statistic of Steele

The purpose of this paper is to study a statistic that is used to compare the similarity between two strings, which is first introduced by Michael Steele in 1982. It was proposed as an alternative to the length of the longest common…

Probability · Mathematics 2023-06-22 Ümit Işlak , Alperen Y. Özdemir

Counting overlapping pairs of words

A correlation is a binary vector that encodes all possible positions of overlaps of two words, where an overlap for an ordered pair of words (u,v) occurs if a suffix of word u matches a prefix of word v. As multiple pairs can have the same…

Discrete Mathematics · Computer Science 2025-06-03 Eric Rivals , Pengfei Wang

On the Variance of the Length of the Longest Common Subsequences in Random Words With an Omitted Letter

We investigate the variance of the length of the longest common subsequences of two independent random words of size $n$, where the letters of one word are i.i.d. uniformly drawn from $\{\alpha_1, \alpha_2, \cdots, \alpha_m\}$, while the…

Probability · Mathematics 2018-12-27 Christian Houdré , Qingqing Liu

Anti dependency distance minimization in short sequences. A graph theoretic approach

Dependency distance minimization (DDm) is a word order principle favouring the placement of syntactically related words close to each other in sentences. Massive evidence of the principle has been reported for more than a decade with the…

Computation and Language · Computer Science 2021-02-02 Ramon Ferrer-i-Cancho , Carlos Gómez-Rodríguez

In many applications of natural language processing it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations ``eat a peach'' and ``eat…

cmp-lg · Computer Science 2008-02-03 Ido Dagan , Fernando Pereira , Lillian Lee

The Statistical Dictionary-based String Matching Problem

In the Dictionary-based String Matching (DSM) problem, a retrieval system has access to a source sequence and stores the position of a certain number of strings in a posting table. When a user inquires the position of a string, the…

Information Retrieval · Computer Science 2018-11-26 M. Suri , S. Rini

Finding Approximate Palindromes in Strings Quickly and Simply

Described are two algorithms to find long approximate palindromes in a string, for example a DNA sequence. A simple algorithm requires O(n)-space and almost always runs in $O(k.n)$-time where n is the length of the string and k is the…

Data Structures and Algorithms · Computer Science 2007-05-23 L. Allison

Algorithmic statistics, prediction and machine learning

Algorithmic statistics considers the following problem: given a binary string $x$ (e.g., some experimental data), find a "good" explanation of this data. It uses algorithmic information theory to define formally what is a good explanation.…

Machine Learning · Computer Science 2015-09-21 Alexey Milovanov