Related papers: Proxy Compression for Language Modeling

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and…

Computation and Language · Computer Science 2025-08-11 Kaizhi Qian , Xulin Fan , Junrui Ni , Slava Shechtman , Mark Hasegawa-Johnson , Chuang Gan , Yang Zhang

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

Subword tokenization is an essential part of modern large language models (LLMs), yet its specific contributions to training efficiency and model performance remain poorly understood. In this work, we decouple the effects of subword…

Computation and Language · Computer Science 2026-05-15 Théo Gigant , Bowen Peng , Jeffrey Quesnelle

Learn Your Tokens: Word-Pooled Tokenization for Language Modeling

Language models typically tokenize text into subwords, using a deterministic, hand-engineered heuristic of combining characters into longer surface-level strings such as 'ing' or whole words. Recent literature has repeatedly shown the…

Computation and Language · Computer Science 2023-10-19 Avijit Thawani , Saurabh Ghanekar , Xiaoyuan Zhu , Jay Pujara

Model Compression and Efficient Inference for Large Language Models: A Survey

Transformer based large language models have achieved tremendous success. However, the significant memory and computational costs incurred during the inference process make it challenging to deploy large models on resource-constrained…

Computation and Language · Computer Science 2024-02-16 Wenxiao Wang , Wei Chen , Yicong Luo , Yongliu Long , Zhengkai Lin , Liye Zhang , Binbin Lin , Deng Cai , Xiaofei He

A Comprehensive Survey of Compression Algorithms for Language Models

How can we compress language models without sacrificing accuracy? The number of compression algorithms for language models is rapidly growing to benefit from remarkable advances of recent language models without side effects due to the…

Computation and Language · Computer Science 2024-01-30 Seungcheol Park , Jaehyeon Choi , Sojin Lee , U Kang

ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in…

Computation and Language · Computer Science 2026-03-05 Chunyuan Deng , Sanket Lokegaonkar , Colin Lockard , Besnik Fetahu , Nasser Zalmout , Xian Li

Projected Compression: Trainable Projection for Efficient Transformer Compression

Large language models have steadily increased in size to achieve improved performance; however, this growth has also led to greater inference time and computational demands. Consequently, there is rising interest in model size reduction…

Machine Learning · Computer Science 2025-06-30 Maciej Stefaniak , Michał Krutul , Jan Małaśnicki , Maciej Pióro , Jakub Krajewski , Sebastian Jaszczur , Marek Cygan , Kamil Adamczewski , Jan Ludziejewski

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs…

Computation and Language · Computer Science 2023-12-07 Huiqiang Jiang , Qianhui Wu , Chin-Yew Lin , Yuqing Yang , Lili Qiu

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. By comparison, token-free models that operate directly on raw text (bytes or characters) have many benefits: they can…

Computation and Language · Computer Science 2022-03-09 Linting Xue , Aditya Barua , Noah Constant , Rami Al-Rfou , Sharan Narang , Mihir Kale , Adam Roberts , Colin Raffel

Compression of Recurrent Neural Networks for Efficient Language Modeling

Recurrent neural networks have proved to be an effective method for statistical language modeling. However, in practice their memory and run-time complexity are usually too large to be implemented in real-time offline mobile applications.…

Computation and Language · Computer Science 2019-04-09 Artem M. Grachev , Dmitry I. Ignatov , Andrey V. Savchenko

From Language Models over Tokens to Language Models over Characters

Modern language models are internally -- and mathematically -- distributions over $\it{token}$ strings rather than $\it{character}$ strings, posing numerous challenges for programmers building user applications on top of them. For example,…

Computation and Language · Computer Science 2025-06-11 Tim Vieira , Ben LeBrun , Mario Giulianelli , Juan Luis Gastaldi , Brian DuSell , John Terilla , Timothy J. O'Donnell , Ryan Cotterell

Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios

Large language models (LLMs) demonstrate exceptional capabilities in various scenarios. However, they suffer from much redundant information and are sensitive to the position of key information in long context scenarios. To address these…

Computation and Language · Computer Science 2025-02-11 Jiwei Tang , Jin Xu , Tingwei Lu , Zhicheng Zhang , Yiming Zhao , Lin Hai , Hai-Tao Zheng

Model Compression vs. Adversarial Robustness: An Empirical Study on Language Models for Code

Transformer-based language models for code have shown remarkable performance in various software analytics tasks, but their adoption is hindered by high computational costs, slow inference speeds, and substantial environmental impact. Model…

Software Engineering · Computer Science 2026-04-15 Md. Abdul Awal , Mrigank Rochan , Chanchal K. Roy

Speech Token Prediction via Compressed-to-fine Language Modeling for Speech Generation

Neural audio codecs, used as speech tokenizers, have demonstrated remarkable potential in the field of speech generation. However, to ensure high-fidelity audio reconstruction, neural audio codecs typically encode audio into long sequences…

Audio and Speech Processing · Electrical Eng. & Systems 2025-06-02 Wenrui Liu , Qian Chen , Wen Wang , Yafeng Chen , Jin Xu , Zhifang Guo , Guanrou Yang , Weiqin Li , Xiaoda Yang , Tao Jin , Minghui Fang , Jialong Zuo , Bai Jionghao , Zemin Liu

Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles

Prompt compression condenses contexts while maintaining their informativeness for different usage scenarios. It not only shortens the inference time and reduces computational costs during the usage of large language models, but also lowers…

Computation and Language · Computer Science 2024-10-21 Xiao Pu , Tianxing He , Xiaojun Wan

A Survey on Transformer Compression

Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV), specially for constructing large language models (LLM) and large vision models (LVM). Model compression methods reduce the memory…

Machine Learning · Computer Science 2024-04-09 Yehui Tang , Yunhe Wang , Jianyuan Guo , Zhijun Tu , Kai Han , Hailin Hu , Dacheng Tao

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

With the wide adoption of language models for IR -- and specifically RAG systems -- the latency of the underlying LLM becomes a crucial bottleneck, since the long contexts of retrieved passages lead large prompts and therefore, compute…

Information Retrieval · Computer Science 2026-04-06 Cornelius Kummer , Lena Jurkschat , Michael Färber , Sahar Vahdati

Compressed-Sensing-Guided, Inference-Aware Structured Reduction for Large Language Models

Large language models deliver strong generative performance but at the cost of massive parameter counts, memory use, and decoding latency. Prior work has shown that pruning and structured sparsity can preserve accuracy under substantial…

Computation and Language · Computer Science 2026-04-17 Andrew Kiruluta

An Empirical Study on Prompt Compression for Large Language Models

Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression…

Computation and Language · Computer Science 2025-05-02 Zheng Zhang , Jinyi Li , Yihuai Lan , Xiang Wang , Hao Wang

FLEXITOKENS: Flexible Tokenization for Evolving Language Models

Adapting language models to new data distributions by simple finetuning is challenging. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to…

Computation and Language · Computer Science 2026-05-14 Abraham Toluwase Owodunni , Orevaoghene Ahia , Sachin Kumar