Related papers: Improving PPM Algorithm Using Dictionaries

Making compression algorithms for Unicode text

The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes.…

Information Theory · Computer Science 2017-01-17 Adam Gleave , Christian Steinruecken

Word-Based Text Compression

Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or…

Information Theory · Computer Science 2008-12-18 Jan Platos , Jiri Dvorsky

A Novel Approach to Compress Centralized Text Data using Indexed Dictionary

Data compression is very important feature in terms of saving the memory space. In this proposal, an indexed dictionary based compression is used for text data, where the word's reference in dictionary is used for compression. This approach…

Other Computer Science · Computer Science 2015-12-23 Vivek Dimri , Prof. Ranjit Biswas

Embedding Compression for Text Classification Using Dictionary Screening

In this paper, we propose a dictionary screening method for embedding compression in text classification tasks. The key purpose of this method is to evaluate the importance of each keyword in the dictionary. To this end, we first train a…

Computation and Language · Computer Science 2022-11-24 Jing Zhou , Xinru Jing , Muyu Liu , Hansheng Wang

A memory versus compression ratio trade-off in PPM via compressed context modeling

Since its introduction prediction by partial matching (PPM) has always been a de facto gold standard in lossless text compression, where many variants improving the compression ratio and speed have been proposed. However, reducing the high…

Data Structures and Algorithms · Computer Science 2012-11-13 M. Oguzhan Kulekci

A Comprehensive Survey of Compression Algorithms for Language Models

How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the…

Computation and Language · Computer Science 2024-01-30 Seungcheol Park , Jaehyeon Choi , Sojin Lee , U Kang

Improving End-to-end Speech Recognition with Pronunciation-assisted Sub-word Modeling

Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word…

Computation and Language · Computer Science 2019-02-22 Hainan Xu , Shuoyang Ding , Shinji Watanabe

PCFGs Can Do Better: Inducing Probabilistic Context-Free Grammars with Many Symbols

Probabilistic context-free grammars (PCFGs) with neural parameterization have been shown to be effective in unsupervised phrase-structure grammar induction. However, due to the cubic computational complexity of PCFG representation and…

Computation and Language · Computer Science 2021-04-29 Songlin Yang , Yanpeng Zhao , Kewei Tu

Neural Machine Translation with Characters and Hierarchical Encoding

Most existing Neural Machine Translation models use groups of characters or whole words as their unit of input and output. We propose a model with a hierarchical char2word encoder, that takes individual characters both as input and output.…

Computation and Language · Computer Science 2016-10-21 Alexander Rosenberg Johansen , Jonas Meinertz Hansen , Elias Khazen Obeid , Casper Kaae Sønderby , Ole Winther

Extension of Dictionary-Based Compression Algorithms for the Quantitative Visualization of Patterns from Log Files

Many services today massively and continuously produce log files of different and varying formats. These logs are important since they contain information about the application activities, which is necessary for improvements by analyzing…

Information Retrieval · Computer Science 2023-04-11 Igor Cherepanov , Jonathan Geraldi Joewono , Arjan Kuijper , Jörn Kohlhammer

Practical Random Access to SLP-Compressed Texts

Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as…

Data Structures and Algorithms · Computer Science 2020-07-21 Travis Gagie , Tomohiro I , Giovanni Manzini , Gonzalo Navarro , Hiroshi Sakamoto , Louisa Seelbach Benkner , Yoshimasa Takabatake

Improving Word Embedding Factorization for Compression Using Distilled Nonlinear Neural Decomposition

Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain…

Computation and Language · Computer Science 2020-11-12 Vasileios Lioutas , Ahmad Rashid , Krtin Kumar , Md Akmal Haidar , Mehdi Rezagholizadeh

Learning Directly from Grammar Compressed Text

Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods…

Machine Learning · Statistics 2020-03-02 Yoichi Sasaki , Kosuke Akimoto , Takanori Maehara

Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data

In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded…

Computation and Language · Computer Science 2026-04-16 Andresa Rodrigues de Campos , David Lee , Imry Kissos , Piyush Paritosh

Compressing Neural Language Models by Sparse Word Representations

Neural networks are among the state-of-the-art techniques for language modeling. Existing neural language models typically map discrete words to distributed, dense vector representations. After information processing of the preceding…

Computation and Language · Computer Science 2016-10-14 Yunchuan Chen , Lili Mou , Yan Xu , Ge Li , Zhi Jin

Overcoming Vocabulary Constraints with Pixel-level Fallback

Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…

Computation and Language · Computer Science 2025-08-12 Jonas F. Lotz , Hendra Setiawan , Stephan Peitz , Yova Kementchedjhieva

Improve Lexicon-based Word Embeddings By Word Sense Disambiguation

There have been some works that learn a lexicon together with the corpus to improve the word embeddings. However, they either model the lexicon separately but update the neural networks for both the corpus and the lexicon by the same…

Computation and Language · Computer Science 2017-07-25 Yuanzhi Ke , Masafumi Hagiwara

Prefix Parsing is Just Parsing

Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling,…

Computation and Language · Computer Science 2026-05-05 Clemente Pasti , Andreas Opedal , Timothy J. O'Donnell , Ryan Cotterell , Tim Vieira

Fast, Small and Exact: Infinite-order Language Modelling with Compressed Suffix Trees

Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily…

Computation and Language · Computer Science 2016-08-17 Ehsan Shareghi , Matthias Petri , Gholamreza Haffari , Trevor Cohn

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this…

Computation and Language · Computer Science 2016-06-13 Rico Sennrich , Barry Haddow , Alexandra Birch