Related papers: Improving PPM Algorithm Using Dictionaries
The majority of online content is written in languages other than English, and is most commonly encoded in UTF-8, the world's dominant Unicode character encoding. Traditional compression algorithms typically operate on individual bytes.…
Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or…
Data compression is very important feature in terms of saving the memory space. In this proposal, an indexed dictionary based compression is used for text data, where the word's reference in dictionary is used for compression. This approach…
In this paper, we propose a dictionary screening method for embedding compression in text classification tasks. The key purpose of this method is to evaluate the importance of each keyword in the dictionary. To this end, we first train a…
Since its introduction prediction by partial matching (PPM) has always been a de facto gold standard in lossless text compression, where many variants improving the compression ratio and speed have been proposed. However, reducing the high…
How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the…
Most end-to-end speech recognition systems model text directly as a sequence of characters or sub-words. Current approaches to sub-word extraction only consider character sequence frequencies, which at times produce inferior sub-word…
Probabilistic context-free grammars (PCFGs) with neural parameterization have been shown to be effective in unsupervised phrase-structure grammar induction. However, due to the cubic computational complexity of PCFG representation and…
Most existing Neural Machine Translation models use groups of characters or whole words as their unit of input and output. We propose a model with a hierarchical char2word encoder, that takes individual characters both as input and output.…
Many services today massively and continuously produce log files of different and varying formats. These logs are important since they contain information about the application activities, which is necessary for improvements by analyzing…
Grammar-based compression is a popular and powerful approach to compressing repetitive texts but until recently its relatively poor time-space trade-offs during real-life construction made it impractical for truly massive datasets such as…
Word-embeddings are vital components of Natural Language Processing (NLP) models and have been extensively explored. However, they consume a lot of memory which poses a challenge for edge deployment. Embedding matrices, typically, contain…
Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods…
In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded…
Neural networks are among the state-of-the-art techniques for language modeling. Existing neural language models typically map discrete words to distributed, dense vector representations. After information processing of the preceding…
Subword tokenization requires balancing computational efficiency and vocabulary coverage, which often leads to suboptimal performance on languages and scripts not prioritized during training. We propose to augment pretrained language models…
There have been some works that learn a lexicon together with the corpus to improve the word embeddings. However, they either model the lexicon separately but update the neural networks for both the corpus and the lexicon by the same…
Prefix parsing asks whether an input prefix can be extended to a complete string generated by a given grammar. In the weighted setting, it also provides prefix probabilities, which are central to context-free language modeling,…
Efficient methods for storing and querying are critical for scaling high-order n-gram language models to large corpora. We propose a language model based on compressed suffix trees, a representation that is highly compact and can be easily…
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this…