Related papers: Lightweight Conceptual Dictionary Learning for Tex…

LZ-Compressed String Dictionaries

We show how to compress string dictionaries using the Lempel-Ziv (LZ78) data compression algorithm. Our approach is validated experimentally on dictionaries of up to 1.5 GB of uncompressed text. We achieve compression ratios often…

Data Structures and Algorithms · Computer Science 2013-05-06 Julian Arz , Johannes Fischer

Bit-Optimal Lempel-Ziv compression

One of the most famous and investigated lossless data-compression scheme is the one introduced by Lempel and Ziv about 40 years ago. This compression scheme is known as "dictionary-based compression" and consists of squeezing an input…

Data Structures and Algorithms · Computer Science 2008-02-07 Paolo Ferragina , Igor Nitto , Rossano Venturini

Learning to Weight for Text Classification

In information retrieval (IR) and related tasks, term weighting approaches typically consider the frequency of the term in the document and in the collection in order to compute a score reflecting the importance of the term for the…

Machine Learning · Computer Science 2021-09-22 Alejandro Moreo Fernández , Andrea Esuli , Fabrizio Sebastiani

Text Ranking and Classification using Data Compression

A well-known but rarely used approach to text categorization uses conditional entropy estimates computed using data compression tools. Text affinity scores derived from compressed sizes can be used for classification and ranking tasks, but…

Machine Learning · Computer Science 2021-12-08 Nitya Kasturi , Igor L. Markov

Embedding Compression for Text Classification Using Dictionary Screening

In this paper, we propose a dictionary screening method for embedding compression in text classification tasks. The key purpose of this method is to evaluate the importance of each keyword in the dictionary. To this end, we first train a…

Computation and Language · Computer Science 2022-11-24 Jing Zhou , Xinru Jing , Muyu Liu , Hansheng Wang

AlphaZip: Neural Network-Enhanced Lossless Text Compression

Data compression continues to evolve, with traditional information theory methods being widely used for compressing text, images, and videos. Recently, there has been growing interest in leveraging Generative AI for predictive compression…

Information Theory · Computer Science 2024-09-24 Swathi Shree Narashiman , Nitin Chandrachoodan

Lempel-Ziv-like Parsing in Small Space

Lempel-Ziv (LZ77 or, briefly, LZ) is one of the most effective and widely-used compressors for repetitive texts. However, the existing efficient methods computing the exact LZ parsing have to use linear or close to linear space to index the…

Data Structures and Algorithms · Computer Science 2020-05-12 Dmitry Kosolobov , Daniel Valenzuela , Gonzalo Navarro , Simon J. Puglisi

Compression with the tudocomp Framework

We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well…

Data Structures and Algorithms · Computer Science 2021-04-23 Patrick Dinklage , Johannes Fischer , Dominik Köppl , Marvin Löbel , Kunihiko Sadakane

Information-theoretic Dictionary Learning for Image Classification

We present a two-stage approach for learning dictionaries for object classification tasks based on the principle of information maximization. The proposed method seeks a dictionary that is compact, discriminative, and generative. In the…

Computer Vision and Pattern Recognition · Computer Science 2015-03-20 Qiang Qiu , Vishal M. Patel , Rama Chellappa

IDBE - An Intelligent Dictionary Based Encoding Algorithm for Text Data Compression for High Speed Data Transmission Over Internet

Compression algorithms reduce the redundancy in data representation to decrease the storage required for that data. Data compression offers an attractive approach to reducing communication costs by using available bandwidth effectively.…

Information Theory · Computer Science 2007-07-13 B. S. Shajee Mohan , V. K. Govindan

Online Embedding Compression for Text Classification using Low Rank Matrix Factorization

Deep learning models have become state of the art for natural language processing (NLP) tasks, however deploying these models in production system poses significant memory constraints. Existing compression methods are either lossy or…

Machine Learning · Computer Science 2018-11-05 Anish Acharya , Rahul Goel , Angeliki Metallinou , Inderjit Dhillon

Text Classification with Compression Algorithms

This work concerns a comparison of SVM kernel methods in text categorization tasks. In particular I define a kernel function that estimates the similarity between two objects computing by their compressed lengths. In fact, compression…

Machine Learning · Computer Science 2012-10-30 Antonio Giuliano Zippo

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text retrieval (ITR) is a task to retrieve the relevant images/texts, given the query from another modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream…

Computer Vision and Pattern Recognition · Computer Science 2023-02-07 Ziyang luo , Pu Zhao , Can Xu , Xiubo Geng , Tao Shen , Chongyang Tao , Jing Ma , Qingwen lin , Daxin Jiang

MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression

Large language models have drastically changed the prospects of AI by introducing technologies for more complex natural language processing. However, current methodologies to train such LLMs require extensive resources including but not…

Computation and Language · Computer Science 2026-04-27 Noel Elias , Homa Esfahanizadeh , Kaan Kale , Sriram Vishwanath , Muriel Medard

Compression Algorithm Based on Irregular Sequence

The paper introduces a new lossless, highly robust compression algorithm that similar with LZW algorithm, yet the algorithm discards dictionary processing and uses irregular sequences with massive, random information instead. Then the paper…

Signal Processing · Electrical Eng. & Systems 2020-06-24 Rui Zhu

TexShape: Information Theoretic Sentence Embedding for Language Models

With the exponential growth in data volume and the emergence of data-intensive applications, particularly in the field of machine learning, concerns related to resource utilization, privacy, and fairness have become paramount. This paper…

Computation and Language · Computer Science 2024-05-14 Kaan Kale , Homa Esfahanizadeh , Noel Elias , Oguzhan Baser , Muriel Medard , Sriram Vishwanath

An Enhanced Text Compression Approach Using Transformer-based Language Models

Text compression shrinks textual data while keeping crucial information, eradicating constraints on storage, bandwidth, and computational efficacy. The integration of lossless compression techniques with transformer-based text decompression…

Computation and Language · Computer Science 2024-12-23 Chowdhury Mofizur Rahman , Mahbub E Sobhani , Anika Tasnim Rodela , Swakkhar Shatabda

Text Classification based on Word Subspace with Term-Frequency

Text classification has become indispensable due to the rapid increase of text in digital form. Over the past three decades, efforts have been made to approach this task using various learning algorithms and statistical models based on…

Machine Learning · Statistics 2018-06-11 Erica K. Shimomoto , Lincon S. Souza , Bernardo B. Gatto , Kazuhiro Fukui

Label Confidence Weighted Learning for Target-level Sentence Simplification

Multi-level sentence simplification generates simplified sentences with varying language proficiency levels. We propose Label Confidence Weighted Learning (LCWL), a novel approach that incorporates a label confidence weighting scheme in the…

Computation and Language · Computer Science 2024-10-10 Xinying Qiu , Jingshen Zhang

Time and Memory Efficient Lempel-Ziv Compression Using Suffix Arrays

The well-known dictionary-based algorithms of the Lempel-Ziv (LZ) 77 family are the basis of several universal lossless compression techniques. These algorithms are asymmetric regarding encoding/decoding time and memory requirements, with…

Data Structures and Algorithms · Computer Science 2009-12-31 Artur Ferreira , Arlindo Oliveira , Mario Figueiredo