Related papers: Practical Entropy-Compressed Rank/Select Dictionar…
Rank and select queries on bitmaps are essential building bricks of many compressed data structures, including text indexes, membership and range supporting spatial data structures, compressed graphs, and more. Theoretically considered yet…
The choice dictionary is introduced as a data structure that can be initialized with a parameter $n\in\mathbb{N}=\{1,2,\ldots\}$ and subsequently maintains an initially empty subset $S$ of $\{1,\ldots,n\}$ under insertion, deletion,…
Bit vectors are fundamental building blocks of many succinct data structures. They can be used to represent graphs, are an important part of many text indices in the form of the wavelet tree, and can be used to encode ordered sequences of…
Sequence representations supporting not only direct access to their symbols, but also rank/select operations, are a fundamental building block in many compressed data structures. Several recent applications need to represent highly…
Given a string $S$ of length $N$ on a fixed alphabet of $\sigma$ symbols, a grammar compressor produces a context-free grammar $G$ of size $n$ that generates $S$ and only $S$. In this paper we describe data structures to support the…
Large-alphabet strings are common in scenarios such as information retrieval and natural-language processing. The efficient storage and processing of such strings usually introduces several challenges that are not witnessed in…
Succinct data structures give space-efficient representations of large amounts of data without sacrificing performance. They rely one cleverly designed data representations and algorithms. We present here the formalization in Coq/SSReflect…
Bit vectors with support for fast rank and select are a fundamental building block for compressed data structures. We close a gap between theory and practice by analyzing an important part of the design space and experimentally evaluating a…
A choice dictionary is a data structure that can be initialized with a parameter $n\in\{1,2,\ldots\}$ and subsequently maintains an initially empty subset $S$ of $\{1,\ldots,n\}$ under insertion, deletion, membership queries and an…
Data partitioning that maximizes/minimizes the Shannon entropy, or more generally the R\'enyi entropy is a crucial subroutine in data compression, columnar storage, and cardinality estimation algorithms. These partition algorithms can be…
A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but…
The determination of block-entropies is a well established method for the investigation of discrete data, also called symbols (7). There is a large variety of such symbolic sequences, ranging from texts written in natural languages,…
Sequence representations supporting queries $access$, $select$ and $rank$ are at the core of many data structures. There is a considerable gap between the various upper bounds and the few lower bounds known for such representations, and how…
The definition of $k^{th}$-order empirical entropy of strings is extended to node labelled binary trees. A suitable binary encoding of tree straight-line programs (that have been used for grammar-based tree compression before) is shown to…
We describe a data structure that stores a string $S$ in space similar to that of its Lempel-Ziv encoding and efficiently supports access, rank and select queries. These queries are fundamental for implementing succinct and compressed data…
Grammar compression represents a string as a context free grammar. Achieving compression requires encoding such grammar as a binary string; there are a few commonly used encodings. We bound the size of practically used encodings for several…
It has been shown in the indexing literature that there is an essential difference between prefix/range searches on the one hand, and predecessor/rank searches on the other hand, in that the former provably allows faster query resolution.…
Practical data structures for the edit-sensitive parsing (ESP) are proposed. Given a string S, its ESP tree is equivalent to a context-free grammar G generating just S, which is represented by a DAG. Using the succinct data structures for…
Probabilistic embeddings have several advantages over deterministic embeddings as they map each data point to a distribution, which better describes the uncertainty and complexity of data. Many works focus on adjusting the distribution…
Sample selection improves the efficiency and effectiveness of machine learning models by providing informative and representative samples. Typically, samples can be modeled as a sample graph, where nodes are samples and edges represent…