Related papers: Making compression algorithms for Unicode text
Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or…
We propose a method to improve traditional character-based PPM text compression algorithms. Consider a text file as a sequence of alternating words and non-words, the basic idea of our algorithm is to encode non-words and prefixes of words…
This paper investigates the employment of various encoders in text transformation, converting characters into bytes. It discusses local encoders such as ASCII and GB-2312, which encode specific characters into shorter bytes, and universal…
Bidirectional compression algorithms work by substituting repeated substrings by references that, unlike in the famous LZ77-scheme, can point to either direction. We present such an algorithm that is particularly suited for an external…
We often represent text using Unicode formats (UTF-8 and UTF-16). The UTF-8 format is increasingly popular, especially on the web (XML, HTML, JSON, Rust, Go, Swift, Ruby). The UTF-16 format is most common in Java, .NET, and inside operating…
In software, text is often represented using Unicode formats (UTF-8 and UTF-16). We frequently have to convert text from one format to the other, a process called transcoding. Popular transcoding functions are slower than state-of-the-art…
Data compression has become a necessity not only the in the field of communication but also in various scientific experiments. The data that is being received is more and the processing time required has also become more. A significant…
In this paper, we revisit the classical data compression problem for domain specific texts. It is well-known that classical Huffman algorithm is optimal with respect to prefix encoding and the compression is done at character level. Since…
We present frequency-ordered tokenization, a simple preprocessing technique that improves lossless text compression by exploiting the power-law frequency distribution of natural language tokens (Zipf's law). The method tokenizes text with…
A text written using symbols from a given alphabet can be compressed using the Huffman code, which minimizes the length of the encoded text. It is necessary, however, to employ a text-specific codebook, i.e. the symbol-codeword dictionary,…
The paper introduces a new lossless, highly robust compression algorithm that similar with LZW algorithm, yet the algorithm discards dictionary processing and uses irregular sequences with massive, random information instead. Then the paper…
The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available…
People tend to store a lot of files inside theirs storage. When the storage nears it limit, they then try to reduce those files size to minimum by using data compression software. In this paper we propose a new algorithm for data…
We propose a special-purpose class of compression algorithms for efficient compression of Prolog programs. It is a dictionary-based compression method, specially designed for the compression of Prolog code, and therefore we name it PCA…
This article shows how the text characters that have multiple representations under the Unicode standard are treated by popular operating systems. Whilst most characters have a unique representation in Unicode, some characters such as the…
Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively.…
Huffman coding is known to be optimal, yet its dynamic version may be even more efficient in practice. A new variant of Huffman encoding has been proposed recently, that provably always performs better than static Huffman coding by at least…
Inspired by recent work on compression with and for young humans, the success of transform-based approaches to information processing, and the rise of powerful language-based AI, we propose \emph{textual transform coding}. It shares some of…
It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the…
We consider the problem of {\em restructuring} compressed texts without explicit decompression. We present algorithms which allow conversions from compressed representations of a string $T$ produced by any grammar-based compression…