Related papers: String Sampling with Bidirectional String Anchors
Minimizers sampling is one of the most widely-used mechanisms for sampling strings. Let $S=S[0]\ldots S[n-1]$ be a string over an alphabet $\Sigma$. In addition, let $w\geq 2$ and $k\geq 1$ be two integers and $\rho=(\Sigma^k,\leq)$ be a…
Minimizers sampling is one of the most widely-used mechanisms for sampling strings [Roberts et al., Bioinformatics 2004]. Let $S=S[1]\ldots S[n]$ be a string over a totally ordered alphabet $\Sigma$. Further let $w\geq 2$ and $k\geq 1$ be…
Minimizers are sampling schemes with numerous applications in computational biology. Assuming a fixed alphabet of size $\sigma$, a minimizer is defined by two integers $k,w\ge2$ and a linear order $\rho$ on strings of length $k$ (also…
Minimizer schemes, or just minimizers, are a very important computational primitive in sampling and sketching biological strings. Assuming a fixed alphabet of size $\sigma$, a minimizer is defined by two integers $k,w\ge2$ and a total order…
Sampling (evenly) the suffixes from the suffix array is an old idea trading the pattern search time for reduced index space. A few years ago Claude et al. showed an alphabet sampling scheme allowing for more efficient pattern searches…
Given a set of strings over a specified alphabet, identifying a median or consensus string that minimizes the total distance to all input strings is a fundamental data aggregation problem. When the Hamming distance is considered as the…
String attractors [STOC 2018] are combinatorial objects recently introduced to unify all known dictionary compression techniques in a single theory. A set $\Gamma\subseteq [1..n]$ is a $k$-attractor for a string $S\in[1..\sigma]^n$ if and…
A well-known fact in the field of lossless text compression is that high-order entropy is a weak model when the input contains long repetitions. Motivated by this, decades of research have generated myriads of so-called dictionary…
Let $S$ be a string of length $n$. In this paper we introduce the notion of \emph{string attractor}: a subset of the string's positions $[1,n]$ such that every distinct substring of $S$ has an occurrence crossing one of the attractor's…
Min-entropy sampling gives a bound on the min-entropy of a randomly chosen subset of a string, given a bound on the min-entropy of the whole string. K\"onig and Renner showed a min-entropy sampling theorem that holds relative to quantum…
Motivated by the imminent growth of massive, highly redundant genomic databases, we study the problem of compressing a string database while simultaneously supporting fast random access, substring extraction and pattern matching to the…
In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to…
We propose novel algorithms for sequence prediction based on ideas from stringology. These algorithms are time and space efficient and satisfy mistake bounds related to particular stringological complexity measures of the sequence. In this…
MinMax sampling is a technique for downsampling a real-valued vector which minimizes the maximum variance over all vector components. This approach is useful for reducing the amount of data that must be sent over a constrained network link…
In experimental design, we are given a large collection of vectors, each with a hidden response value that we assume derives from an underlying linear model, and we wish to pick a small subset of the vectors such that querying the…
Given a string $w$, another string $v$ is said to be a subsequence of $w$ if $v$ can be obtained from $w$ by removing some of its letters; on the other hand, $v$ is called an absent subsequence of $w$ if $v$ is not a subsequence of $w$. The…
Searching for all occurrences of a pattern in a text is a fundamental problem in computer science with applications in many other fields, like natural language processing, information retrieval and computational biology. Sampled string…
The problem of finding a center string that is `close' to every given string arises and has many applications in computational biology and coding theory. This problem has two versions: the Closest String problem and the Closest Substring…
We consider compact representations of collections of similar strings that support random access queries. The collection of strings is given by a rooted tree where edges are labeled by an edit operation (inserting, deleting, or replacing a…
Numbers and numerical vectors account for a large portion of data. However, recently the amount of string data generated has increased dramatically. Consequently, classifying string data is a common problem in many fields. The most widely…