Related papers: Sampling the suffix array with minimizers
The suffix array is a classic full-text index, combining effectiveness with simplicity. We discuss three approaches aiming to improve its efficiency even more: changes to the navigation, data layout and adding extra data. In short, we show…
The Suffix Array is a classic text index enabling on-line pattern matching queries via simple binary search. The main drawback of the Suffix Array is that it takes linear space in the text's length, even if the text itself is extremely…
We present several results about position heaps, a relatively new alternative to suffix trees and suffix arrays. First, we show that, if we limit the maximum length of patterns to be sought, then we can also limit the height of the heap and…
We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a…
When augmented with the longest common prefix (LCP) array and some other structures, the suffix array can solve many string processing problems in optimal time and space. A compressed representation of the LCP array is also one of the main…
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string…
Subsequence-based time series classification algorithms provide accurate and interpretable models, but training these models is extremely computation intensive. The asymptotic time complexity of subsequence-based algorithms remains a…
Sparse suffix sorting is the problem of sorting $b=o(n)$ suffixes of a string of length $n$. Efficient sparse suffix sorting algorithms have existed for more than a decade. Despite the multitude of works and their justified claims for…
Packing several characters into one computer word is a simple and natural way to compress the representation of a string and to speed up its processing. Exploiting this idea, we propose an index for a packed string, based on a {\em sparse…
A deterministic BSP algorithm for constructing the suffix array of a given string is presented, based on a technique which we call accelerated sampling. It runs in optimal O(n/p) local computation and communication, and requires a near…
Stemming or suffix stripping, an important part of the modern Information Retrieval systems, is to find the root word (stem) out of a given cluster of words. Existing algorithms targeting this problem have been developed in a haphazard…
We propose two suffix array inspired full-text indexes. One, called SA-hash, augments the suffix array with a hash table to speed up pattern searches due to significantly narrowed search interval before the binary search phase. The other,…
The suffix array is an efficient data structure for in-memory pattern search. Suffix arrays can also be used for external-memory pattern search, via two-level structures that use an internal index to identify the correct block of suffix…
It has been shown in the indexing literature that there is an essential difference between prefix/range searches on the one hand, and predecessor/rank searches on the other hand, in that the former provably allows faster query resolution.…
Minimax distance measure extracts the underlying patterns and manifolds in an unsupervised manner. The existing methods require a quadratic memory with respect to the number of objects. In this paper, we investigate efficient sampling…
We present parallel algorithms for exact and approximate pattern matching with suffix arrays, using a CREW-PRAM with $p$ processors. Given a static text of length $n$, we first show how to compute the suffix array interval of a given…
Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linear-size data…
Many signal and image processing applications have benefited remarkably from the fact that the underlying signals reside in a low dimensional subspace. One of the main models for such a low dimensionality is the sparsity one. Within this…
We solve the problem of finding interspersed maximal repeats using a suffix array construction. As it is well known, all the functionality of suffix trees can be handled by suffix arrays, gaining practicality. Our solution improves the…
Suffixient sets are a novel prefix array (PA) compression technique based on subsampling PA (rather than compressing the entire array like previous techniques used to do): by storing very few entries of PA (in fact, a compressed number of…