Related papers: Compressed String Dictionaries
We introduce a new family of compressed data structures to efficiently store and query large string dictionaries in main memory. Our main technique is a combination of hierarchical Front-coding with ideas from longest-common-prefix…
Most of the fastest-growing string collections today are repetitive, that is, most of the constituent documents are similar to many others. As these collections keep growing, a key approach to handling them is to exploit their…
Given a string $S$ of length $n$, the classic string indexing problem is to preprocess $S$ into a compact data structure that supports efficient subsequent pattern queries. In this paper we consider the basic variant where the pattern is…
We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often…
Tries are popular data structures for storing a set of strings, where common prefixes are represented by common root-to-node paths. Over fifty years of usage have produced many variants and implementations to overcome some of their…
Data compression is very important feature in terms of saving the memory space. In this proposal, an indexed dictionary based compression is used for text data, where the word's reference in dictionary is used for compression. This approach…
We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. We implement these…
A keyword dictionary is an associative array whose keys are strings. Recent applications handling massive keyword dictionaries in main memory have a need for a space-efficient implementation. When limited to static applications, there are a…
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into…
String constraint solving refers to solving combinatorial problems involving constraints over string variables. String solving approaches have become popular over the last years given the massive use of strings in different application…
The dictionary matching problem is to locate occurrences of any pattern among a set of patterns in a given text. Massive data sets abound and at the same time, there are many settings in which working space is extremely limited. We…
Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through…
Today, with the growing demands of information storage and data transfer, data compression is becoming increasingly important. Data Compression is a technique which is used to decrease the size of data. This is very useful when some huge…
Compressed indexing is a powerful technique that enables efficient querying over data stored in compressed form, significantly reducing memory usage and often accelerating computation. While extensive progress has been made for…
Given a set of pattern strings $\mathcal{P}=\{P_1, P_2,\ldots P_k\}$ and a text string $S$, the classic dictionary matching problem is to report all occurrences of each pattern in $S$. We study the dictionary problem in the compressed…
The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one…
In this paper we present a general method for information extraction that exploits the features of data compression techniques. We first define and focus our attention on the so-called "dictionary" of a sequence. Dictionaries are…
In this paper, a new compression scheme for text is presented. The same is efficient in giving high compression ratios and enables super fast searching within the compressed text. Typical compression ratios of 70-80% and reducing the search…
Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in…
An indexed sequence of strings is a data structure for storing a string sequence that supports random access, searching, range counting and analytics operations, both for exact matches and prefix search. String sequences lie at the core of…