English

Efficient Approximation Algorithms for String Kernel Based Sequence Classification

Data Structures and Algorithms 2017-12-13 v1

Abstract

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between kk-mers (kk-length subsequences) in the two sequences. Extending this definition, by considering two kk-mers to match if their distance is at most mm, yields better classification performance. This, however, makes the problem computationally much more complex. Known algorithms to compute this similarity have computational complexity that render them applicable only for small values of kk and mm. In this work, we develop novel techniques to efficiently and accurately estimate the pairwise similarity score, which enables us to use much larger values of kk and mm, and get higher predictive accuracy. This opens up a broad avenue of applying this classification approach to audio, images, and text sequences. Our algorithm achieves excellent approximation performance with theoretical guarantees. In the process we solve an open combinatorial problem, which was posed as a major hindrance to the scalability of existing solutions. We give analytical bounds on quality and runtime of our algorithm and report its empirical performance on real world biological and music sequences datasets.

Keywords

Cite

@article{arxiv.1712.04264,
  title  = {Efficient Approximation Algorithms for String Kernel Based Sequence Classification},
  author = {Muhammad Farhan and Juvaria Tariq and Arif Zaman and Mudassir Shabbir and Imdad Ullah Khan},
  journal= {arXiv preprint arXiv:1712.04264},
  year   = {2017}
}