English

Efficient Index for Weighted Sequences

Data Structures and Algorithms 2016-02-04 v1

Abstract

The problem of finding factors of a text string which are identical or similar to a given pattern string is a central problem in computer science. A generalised version of this problem consists in implementing an index over the text to support efficient on-line pattern queries. We study this problem in the case where the text is weighted: for every position of the text and every letter of the alphabet a probability of occurrence of this letter at this position is given. Sequences of this type, also called position weight matrices, are commonly used to represent imprecise or uncertain data. A weighted sequence may represent many different strings, each with probability of occurrence equal to the product of probabilities of its letters at subsequent positions. Given a probability threshold 1/z1/z, we say that a pattern string PP matches a weighted text at position ii if the product of probabilities of the letters of PP at positions i,,i+P1i,\ldots,i+|P|-1 in the text is at least 1/z1/z. In this article, we present an O(nz)O(nz)-time construction of an O(nz)O(nz)-sized index that can answer pattern matching queries in a weighted text in optimal time improving upon the state of the art by a factor of zlogzz \log z. Other applications of this data structure include an O(nz)O(nz)-time construction of the weighted prefix table and an O(nz)O(nz)-time computation of all covers of a weighted sequence, which improve upon the state of the art by the same factor.

Keywords

Cite

@article{arxiv.1602.01116,
  title  = {Efficient Index for Weighted Sequences},
  author = {Carl Barton and Tomasz Kociumaka and Solon P. Pissis and Jakub Radoszewski},
  journal= {arXiv preprint arXiv:1602.01116},
  year   = {2016}
}

Comments

14 pages

R2 v1 2026-06-22T12:42:23.280Z