Approximate word matches between two random sequences

Conrad J. Burden; Miriam R. Kantorovitz; Susan R. Wilson

doi:10.1214/07-AAP452

Approximate word matches between two random sequences

Probability 2009-09-29 v1

Authors: Conrad J. Burden , Miriam R. Kantorovitz , Susan R. Wilson

View on arXiv ↗ PDF ↗ DOI ↗

Abstract

Given two sequences over a finite alphabet $\mathcal{L}$ , the $D_2$ statistic is the number of $m$ -letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the $D_2$ statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For $k<m$ , we look at the count of $m$ -letter word matches with up to $k$ mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Cite

@article{arxiv.0801.3145,
  title  = {Approximate word matches between two random sequences},
  author = {Conrad J. Burden and Miriam R. Kantorovitz and Susan R. Wilson},
  journal= {arXiv preprint arXiv:0801.3145},
  year   = {2009}
}

Comments

Published in at http://dx.doi.org/10.1214/07-AAP452 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

Approximate word matches between two random sequences

Abstract

Cite

Comments

Related papers