Related papers: Making compression algorithms for Unicode text

Word-Based Text Compression

Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or…

Information Theory · Computer Science 2008-12-18 Jan Platos , Jiri Dvorsky

Improving PPM Algorithm Using Dictionaries

We propose a method to improve traditional character-based PPM text compression algorithms. Consider a text file as a sequence of alternating words and non-words, the basic idea of our algorithm is to encode non-words and prefixes of words…

Information Theory · Computer Science 2015-03-17 Yichuan Hu , Jianzhong , Zhang , Farooq Khan , Ying Li

Duncode Characters Shorter

This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal…

Computation and Language · Computer Science 2023-07-12 Changshang Xue

Bidirectional Text Compression in External Memory

Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external…

Data Structures and Algorithms · Computer Science 2019-12-04 Patrick Dinklage , Jonas Ellert , Johannes Fischer , Dominik Köppl , Manuel Penschuck

Unicode at Gigabytes per Second

We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating…

Programming Languages · Computer Science 2023-05-23 Daniel Lemire

Transcoding Billions of Unicode Characters per Second with SIMD Instructions

In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-08-16 Daniel Lemire , Wojciech Muła

An Optimized Huffmans Coding by the method of Grouping

Data compression has become a necessity not only the in the field of communication but also in various scientific experiments. The data that is being received is more and the processing time required has also become more. A significant…

Information Theory · Computer Science 2016-07-29 Gautam R , S Murali

Domain Specific Hierarchical Huffman Encoding

In this paper, we revisit the classical data compression problem for domain specific texts. It is well-known that classical Huffman algorithm is optimal with respect to prefix encoding and the compression is done at character level. Since…

Information Theory · Computer Science 2013-07-04 K. Ilambharathi , G. S. N. V. Venkata Manik , N. Sadagopan , B. Sivaselvan

Frequency-Ordered Tokenization for Better Text Compression

We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with…

Information Theory · Computer Science 2026-02-27 Maximilian Kalcher

Optimal alphabet for single text compression

A text written using symbols from a given alphabet can be compressed using the Huffman code, which minimizes the length of the encoded text. It is necessary, however, to employ a text-specific codebook, i.e. the symbol-codeword dictionary,…

Information Theory · Computer Science 2022-08-02 Armen E. Allahverdyan , Andranik Khachatryan

Compression Algorithm Based on Irregular Sequence

The paper introduces a new lossless, highly robust compression algorithm that similar with LZW algorithm, yet the algorithm discards dictionary processing and uses irregular sequences with massive, random information instead. Then the paper…

Signal Processing · Electrical Eng. & Systems 2020-06-24 Rui Zhu

Validating UTF-8 In Less Than One Instruction Per Byte

The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available…

Databases · Computer Science 2026-04-22 John Keiser , Daniel Lemire

A New Algorithm for Data Compression Optimization

People tend to store a lot of files inside theirs storage. When the storage nears it limit, they then try to reduce those files size to minimum by using data compression software. In this paper we propose a new algorithm for data…

Data Structures and Algorithms · Computer Science 2012-09-06 I. Made Agus Dwi Suarjaya

Efficient Compression of Prolog Programs

We propose a special-purpose class of compression algorithms for efficient compression of Prolog programs. It is a dictionary-based compression method, specially designed for the compression of Prolog code, and therefore we name it PCA…

Programming Languages · Computer Science 2007-05-23 Alin Suciu , Kalman Pusztai

Treatment of Unicode canoncal decomposition among operating systems

This article shows how the text characters that have multiple representations under the Unicode standard are treated by popular operating systems. Whilst most characters have a unique representation in Unicode, some characters such as the…

Other Computer Science · Computer Science 2017-11-30 Efstratios Rappos

IDBE - An Intelligent Dictionary Based Encoding Algorithm for Text Data Compression for High Speed Data Transmission Over Internet

Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively.…

Information Theory · Computer Science 2007-07-13 B. S. Shajee Mohan , V. K. Govindan

Weighted Adaptive Coding

Huffman coding is known to be optimal, yet its dynamic version may be even more efficient in practice. A new variant of Huffman encoding has been proposed recently, that provably always performs better than static Huffman coding by at least…

Data Structures and Algorithms · Computer Science 2020-05-19 Aharon Fruchtman , Yoav Gross , Shmuel T. Klein , Dana Shapira

Toward Textual Transform Coding

Inspired by recent work on compression with and for young humans, the success of transform-based approaches to information processing, and the rise of powerful language-based AI, we propose \emph{textual transform coding}. It shares some of…

Information Theory · Computer Science 2023-05-04 Tsachy Weissman

Encryption by using base-n systems with many characters

It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the…

Cryptography and Security · Computer Science 2023-06-06 Armin Hoenen

Restructuring Compressed Texts without Explicit Decompression

We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression…

Data Structures and Algorithms · Computer Science 2011-07-15 Keisuke Goto , Shirou Maruyama , Shunsuke Inenaga , Hideo Bannai , Hiroshi Sakamoto , Masayuki Takeda