Related papers: Compression with the tudocomp Framework

LZ-Compressed String Dictionaries

We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often…

Data Structures and Algorithms · Computer Science 2013-05-06 Julian Arz , Johannes Fischer

Pushdown Compression

The pressing need for eficient compression schemes for XML documents has recently been focused on stack computation [6, 9], and in particular calls for a formulation of information-lossless stack or pushdown compressors that allows a formal…

Information Theory · Computer Science 2007-09-17 Pilar Albert , Elvira Mayordomo , Philippe Moser , Sylvain Perifel

Polylog space compression, pushdown compression, and Lempel-Ziv are incomparable

The pressing need for efficient compression schemes for XML documents has recently been focused on stack computation, and in particular calls for a formulation of information-lossless stack or pushdown compressors that allows a formal…

Computational Complexity · Computer Science 2009-03-25 Elvira Mayordomo , Philippe Moser , Sylvain Perifel

Sublinear Algorithms for Approximating String Compressibility

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE)…

Data Structures and Algorithms · Computer Science 2007-06-11 Sofya Raskhodnikova , Dana Ron , Ronitt Rubinfeld , Adam Smith

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

We study the approximate string matching and regular expression matching problem for the case when the text to be searched is compressed with the Ziv-Lempel adaptive dictionary compression schemes. We present a time-space trade-off that…

Data Structures and Algorithms · Computer Science 2007-05-23 Philip Bille , Rolf Fagerberg , Inge Li Goertz

Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a…

Computation and Language · Computer Science 2024-05-06 Li Wan , Tansu Alpcan , Margreta Kuijper , Emanuele Viterbo

A Self-contained Analysis of the Lempel-Ziv Compression Algorithm

This article gives a self-contained analysis of the performance of the Lempel-Ziv compression algorithm on (hidden) Markovian sources. Specifically we include a full proof of the assertion that the compression rate approaches the entropy…

Information Theory · Computer Science 2019-10-03 Madhu Sudan , David Xiang

Approximating Optimal Bidirectional Macro Schemes

Lempel-Ziv is an easy-to-compute member of a wide family of so-called macro schemes; it restricts pointers to go in one direction only. Optimal bidirectional macro schemes are NP-complete to find, but they may provide much better…

Data Structures and Algorithms · Computer Science 2020-03-06 Luís M. S. Russo , Ana D. Correia , Gonzalo Navarro , Alexandre P. Francisco

LZD-style Compression Scheme with Truncation and Repetitions

Lempel-Ziv-Double (LZD) is a variation of the LZ78 compression scheme that achieves better compression on repetitive datasets. Nevertheless, prior research has identified computational inefficiencies and a weakness in its compressibility…

Data Structures and Algorithms · Computer Science 2025-05-05 Linus Götz , Dominik Köppl

A High-Throughput Hardware Accelerator for Lempel-Ziv 4 Compression Algorithm

This paper delves into recent hardware implementations of the Lempel-Ziv 4 (LZ4) algorithm, highlighting two key factors that limit the throughput of single-kernel compressors. Firstly, the actual parallelism exhibited in single-kernel…

Hardware Architecture · Computer Science 2024-09-20 Tao Chen , Suwen Song , Zhongfeng Wang

Lossy Compression in Near-Linear Time via Efficient Random Codebooks and Databases

The compression-complexity trade-off of lossy compression algorithms that are based on a random codebook or a random database is examined. Motivated, in part, by recent results of Gupta-Verd\'{u}-Weissman (GVW) and their underlying…

Information Theory · Computer Science 2009-04-23 Chris Gioran , Ioannis Kontoyiannis

Universal Indexes for Highly Repetitive Document Collections

Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes, but are formed mostly of documents that…

Information Retrieval · Computer Science 2016-05-25 Francisco Claude , Antonio Fariña , Miguel A. Martínez-Prieto , Gonzalo Navarro

Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections

Compression techniques that support fast random access are a core component of any information system. Current state-of-the-art methods group documents into fixed-sized blocks and compress each block with a general-purpose adaptive…

Data Structures and Algorithms · Computer Science 2015-03-19 Christopher Hoobin , Simon J. Puglisi , Justin Zobel

A study for Image compression using Re-Pair algorithm

The compression is an important topic in computer science which allows we to storage more amount of data on our data storage. There are several techniques to compress any file. In this manuscript will be described the most important…

Multimedia · Computer Science 2019-02-14 Pasquale De Luca , Vincenzo Maria Russiello , Raffaele Ciro Sannino , Lorenzo Valente

zip2zip: Inference-Time Adaptive Tokenization via Online Compression

Tokenization efficiency plays a critical role in the performance and cost of large language models (LLMs), yet most models rely on static tokenizers optimized on general-purpose corpora. These tokenizers' fixed vocabularies often fail to…

Computation and Language · Computer Science 2025-10-27 Saibo Geng , Nathan Ranchin , Yunzhen yao , Maxime Peyrard , Chris Wendler , Michael Gastpar , Robert West

AlphaZip: Neural Network-Enhanced Lossless Text Compression

Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression…

Information Theory · Computer Science 2024-09-24 Swathi Shree Narashiman , Nitin Chandrachoodan

IDBE - An Intelligent Dictionary Based Encoding Algorithm for Text Data Compression for High Speed Data Transmission Over Internet

Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively.…

Information Theory · Computer Science 2007-07-13 B. S. Shajee Mohan , V. K. Govindan

Time and Memory Efficient Lempel-Ziv Compression Using Suffix Arrays

The well-known dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are the basis of several universal lossless compression techniques. These algorithms are asymmetric regarding encoding/decoding time and memory requirements, with…

Data Structures and Algorithms · Computer Science 2009-12-31 Artur Ferreira , Arlindo Oliveira , Mario Figueiredo

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not…

Computation and Language · Computer Science 2026-04-27 Noel Elias , Homa Esfahanizadeh , Kaan Kale , Sriram Vishwanath , Muriel Medard

Substring Compression Variations and LZ78-Derivates

We propose algorithms computing the semi-greedy Lempel-Ziv 78 (LZ78), the Lempel-Ziv Double (LZD), and the Lempel-Ziv-Miller-Wegman (LZMW) factorizations in linear time for integer alphabets. For LZD and LZMW, we additionally propose data…

Data Structures and Algorithms · Computer Science 2024-09-24 Dominik Köppl