Related papers: Restructuring Compressed Texts without Explicit De…

Grammar Boosting: A New Technique for Proving Lower Bounds for Computation over Compressed Data

Grammar compression is a general compression framework in which a string $T$ of length $N$ is represented as a context-free grammar of size $n$ whose language contains only $T$. In this paper, we focus on studying the limitations of…

Data Structures and Algorithms · Computer Science 2024-09-24 Rajat De , Dominik Kempa

Random Access to Grammar Compressed Strings

Grammar based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures many popular compression schemes. In this paper, we present a novel…

Data Structures and Algorithms · Computer Science 2013-10-30 Philip Bille , Gad M. Landau , Rajeev Raman , Kunihiko Sadakane , Srinivasa Rao Satti , Oren Weimann

Bidirectional Text Compression in External Memory

Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external…

Data Structures and Algorithms · Computer Science 2019-12-04 Patrick Dinklage , Jonas Ellert , Johannes Fischer , Dominik Köppl , Manuel Penschuck

Learning Directly from Grammar Compressed Text

Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods…

Machine Learning · Statistics 2020-03-02 Yoichi Sasaki , Kosuke Akimoto , Takanori Maehara

On optimally partitioning a text to improve its compression

In this paper we investigate the problem of partitioning an input string T in such a way that compressing individually its parts via a base-compressor C gets a compressed output that is shorter than applying C over the entire T at once.…

Data Structures and Algorithms · Computer Science 2009-06-26 Paolo Ferragina , Igor Nitto , Rossano Venturini

Practical Random Access to SLP-Compressed Texts

Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as…

Data Structures and Algorithms · Computer Science 2020-07-21 Travis Gagie , Tomohiro I , Giovanni Manzini , Gonzalo Navarro , Hiroshi Sakamoto , Louisa Seelbach Benkner , Yoshimasa Takabatake

Machine Translation with Unsupervised Length-Constraints

We have seen significant improvements in machine translation due to the usage of deep learning. While the improvements in translation quality are impressive, the encoder-decoder architecture enables many more possibilities. In this paper,…

Computation and Language · Computer Science 2020-04-08 Jan Niehues

Approximation of grammar-based compression via recompression

In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(g log(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string.…

Data Structures and Algorithms · Computer Science 2013-11-08 Artur Jeż

Compressibility-Aware Quantum Algorithms on Strings

Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and…

Data Structures and Algorithms · Computer Science 2023-02-15 Daniel Gibney , Sharma V. Thankachan

Semantic Compression With Large Language Models

The rise of large language models (LLMs) is revolutionizing information retrieval, question answering, summarization, and code generation tasks. However, in addition to confidently presenting factually inaccurate information at times (known…

Artificial Intelligence · Computer Science 2023-04-26 Henry Gilbert , Michael Sandborn , Douglas C. Schmidt , Jesse Spencer-Smith , Jules White

Tree structure compression with RePair

In this work we introduce a new linear time compression algorithm, called "Re-pair for Trees", which compresses ranked ordered trees using linear straight-line context-free tree grammars. Such grammars generalize straight-line context-free…

Data Structures and Algorithms · Computer Science 2010-08-02 Markus Lohrey , Sebastian Maneth , Roy Mennicke

An Enhanced Text Compression Approach Using Transformer-based Language Models

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression…

Computation and Language · Computer Science 2024-12-23 Chowdhury Mofizur Rahman , Mahbub E Sobhani , Anika Tasnim Rodela , Swakkhar Shatabda

Extracting Text Representations for Terms and Phrases in Technical Domains

Extracting dense representations for terms and phrases is a task of great importance for knowledge discovery platforms targeting highly-technical fields. Dense representations are used as features for downstream components and have multiple…

Computation and Language · Computer Science 2023-05-26 Francesco Fusco , Diego Antognini

AlphaZip: Neural Network-Enhanced Lossless Text Compression

Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression…

Information Theory · Computer Science 2024-09-24 Swathi Shree Narashiman , Nitin Chandrachoodan

Semantic Text Compression for Classification

We study semantic compression for text where meanings contained in the text are conveyed to a source decoder, e.g., for classification. The main motivator to move to such an approach of recovering the meaning without requiring exact…

Information Theory · Computer Science 2023-09-20 Emrecan Kutay , Aylin Yener

Word-Based Text Compression

Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or…

Information Theory · Computer Science 2008-12-18 Jan Platos , Jiri Dvorsky

Solving Classical String Problems on Compressed Texts

Here we study the complexity of string problems as a function of the size of a program that generates input. We consider straight-line programs (SLP), since all algorithms on SLP-generated strings could be applied to processing…

Data Structures and Algorithms · Computer Science 2007-05-23 Yury Lifshits

Pattern Matching on Grammar-Compressed Strings in Linear Time

The most fundamental problem considered in algorithms for text processing is pattern matching: given a pattern $p$ of length $m$ and a text $t$ of length $n$, does $p$ occur in $t$? Multiple versions of this basic question have been…

Data Structures and Algorithms · Computer Science 2021-11-10 Moses Ganardi , Paweł Gawrychowski

Engineering Fast and Space-Efficient Recompression from SLP-Compressed Text

Compressed indexing enables powerful queries over massive and repetitive textual datasets using space proportional to the compressed input. While theoretical advances have led to highly efficient index structures, their practical…

Data Structures and Algorithms · Computer Science 2025-10-24 Ankith Reddy Adudodla , Dominik Kempa

Improving PPM Algorithm Using Dictionaries

We propose a method to improve traditional character-based PPM text compression algorithms. Consider a text file as a sequence of alternating words and non-words, the basic idea of our algorithm is to encode non-words and prefixes of words…

Information Theory · Computer Science 2015-03-17 Yichuan Hu , Jianzhong , Zhang , Farooq Khan , Ying Li