English

Approximate word matches between two random sequences

Probability 2009-09-29 v1

Abstract

Given two sequences over a finite alphabet L\mathcal{L}, the D2D_2 statistic is the number of mm-letter word matches between the two sequences. This statistic is used in bioinformatics for expressed sequence tag database searches. Here we study a generalization of the D2D_2 statistic in the context of DNA sequences, under the assumption of strand symmetric Bernoulli text. For k<mk<m, we look at the count of mm-letter word matches with up to kk mismatches. For this statistic, we compute the expectation, give upper and lower bounds for the variance and prove its distribution is asymptotically normal.

Cite

@article{arxiv.0801.3145,
  title  = {Approximate word matches between two random sequences},
  author = {Conrad J. Burden and Miriam R. Kantorovitz and Susan R. Wilson},
  journal= {arXiv preprint arXiv:0801.3145},
  year   = {2009}
}

Comments

Published in at http://dx.doi.org/10.1214/07-AAP452 the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org)

R2 v1 2026-06-21T10:04:47.680Z