Related papers: Efficient Sequence Packing without Cross-contamina…

Hansel: Output Length Controlling Framework for Large Language Models

Despite the great success of large language models (LLMs), efficiently controlling the length of the output sequence still remains a challenge. In this paper, we propose Hansel, an efficient framework for length control in LLMs without…

Computation and Language · Computer Science 2024-12-19 Seoha Song , Junhyun Lee , Hyeonmok Ko

Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models

Large language models (LLMs) sometimes fail to respond appropriately to deterministic tasks -- such as counting or forming acronyms -- because the implicit prior distribution they have learned over sequences of tokens influences their…

Computation and Language · Computer Science 2025-04-18 Liyi Zhang , Veniamin Veselovsky , R. Thomas McCoy , Thomas L. Griffiths

Protum: A New Method For Prompt Tuning Based on "[MASK]"

Recently, prompt tuning \cite{lester2021power} has gradually become a new paradigm for NLP, which only depends on the representation of the words by freezing the parameters of pre-trained language models (PLMs) to obtain remarkable…

Computation and Language · Computer Science 2022-01-31 Pan He , Yuxi Chen , Yan Wang , Yanru Zhang

Boosting Distributed Training Performance of the Unpadded BERT Model

Pre-training models are an important tool in Natural Language Processing (NLP), while the BERT model is a classic pre-training model whose structure has been widely adopted by followers. It was even chosen as the reference model for the…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-08-18 Jinle Zeng , Min Li , Zhihua Wu , Jiaqi Liu , Yuang Liu , Dianhai Yu , Yanjun Ma

Retrofitting Large Language Models with Dynamic Tokenization

Current language models (LMs) use a fixed, static subword tokenizer. This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English. To address this issue, we challenge the…

Computation and Language · Computer Science 2025-06-12 Darius Feher , Ivan Vulić , Benjamin Minixhofer

SPEED: Speculative Pipelined Execution for Efficient Decoding

Generative Large Language Models (LLMs) based on the Transformer architecture have recently emerged as a dominant foundation model for a wide range of Natural Language Processing tasks. Nevertheless, their application in real-time scenarios…

Computation and Language · Computer Science 2024-01-04 Coleman Hooper , Sehoon Kim , Hiva Mohammadzadeh , Hasan Genc , Kurt Keutzer , Amir Gholami , Sophia Shao

Linearization Explains Fine-Tuning in Large Language Models

Parameter-Efficient Fine-Tuning (PEFT) is a popular class of techniques that strive to adapt large models in a scalable and resource-efficient manner. Yet, the mechanisms underlying their training performance and generalization remain…

Machine Learning · Computer Science 2026-02-10 Zahra Rahimi Afzal , Tara Esmaeilbeig , Mojtaba Soltanalian , Mesrob I. Ohannessian

D4: Improving LLM Pretraining via Document De-Duplication and Diversification

Over recent years, an increasing amount of compute and data has been poured into training large language models (LLMs), usually by doing one-pass learning on as many tokens as possible randomly selected from large-scale web corpora. While…

Computation and Language · Computer Science 2023-08-24 Kushal Tirumala , Daniel Simig , Armen Aghajanyan , Ari S. Morcos

Self-Selected Attention Span for Accelerating Large Language Model Inference

Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this…

Computation and Language · Computer Science 2024-04-16 Tian Jin , Wanzin Yazar , Zifei Xu , Sayeh Sharify , Xin Wang

LongAlign: A Recipe for Long Context Alignment of Large Language Models

Extending large language models to effectively handle long contexts requires instruction fine-tuning on input sequences of similar length. To address this, we present LongAlign -- a recipe of the instruction data, training, and evaluation…

Computation and Language · Computer Science 2024-02-01 Yushi Bai , Xin Lv , Jiajie Zhang , Yuze He , Ji Qi , Lei Hou , Jie Tang , Yuxiao Dong , Juanzi Li

How Well do LLMs Compress Their Own Chain-of-Thought? A Token Complexity Approach

Chain-of-thought prompting has emerged as a powerful technique for enabling large language models (LLMs) to solve complex reasoning tasks. However, these reasoning chains can be verbose, raising concerns about efficiency. In response,…

Computation and Language · Computer Science 2025-04-02 Ayeong Lee , Ethan Che , Tianyi Peng

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on…

Computation and Language · Computer Science 2021-09-16 Vin Sachidananda , Jason S. Kessler , Yi-an Lai

A Survey of Token Compression for Efficient Multimodal Large Language Models

Multimodal large language models (MLLMs) have made remarkable strides, largely driven by their ability to process increasingly long and complex contexts, such as high-resolution images, extended video sequences, and lengthy audio input.…

Computer Vision and Pattern Recognition · Computer Science 2026-02-03 Kele Shao , Keda Tao , Kejia Zhang , Sicheng Feng , Mu Cai , Yuzhang Shang , Haoxuan You , Can Qin , Yang Sui , Huan Wang

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts…

Computation and Language · Computer Science 2024-06-04 Jungmin Yun , Mihyeon Kim , Youngbin Kim

Towards Efficient Active Learning in NLP via Pretrained Representations

Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications. When labeled documents are scarce, active learning helps save annotation efforts but requires retraining of massive…

Machine Learning · Computer Science 2024-02-27 Artem Vysogorets , Achintya Gopal

EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models

Speculative decoding emerges as a pivotal technique for enhancing the inference speed of Large Language Models (LLMs). Despite recent research aiming to improve prediction efficiency, multi-sample speculative decoding has been overlooked…

Computation and Language · Computer Science 2024-10-15 Yunsheng Ni , Chuanjian Liu , Yehui Tang , Kai Han , Yunhe Wang

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires…

Computation and Language · Computer Science 2024-08-20 Xukun Liu , Bowen Lei , Ruqi Zhang , Dongkuan Xu

BatchPrompt: Accomplish more with less

As the ever-increasing token limits of large language models (LLMs) have enabled long context as input, prompting with single data samples might no longer an efficient way. A straightforward strategy improving efficiency is to batch data…

Computation and Language · Computer Science 2024-07-16 Jianzhe Lin , Maurice Diesendruck , Liang Du , Robin Abraham

Mini-batch Coresets for Memory-efficient Language Model Training on Data Mixtures

Training with larger mini-batches improves the convergence rate and can yield superior performance. However, training with large mini-batches becomes prohibitive for Large Language Models (LLMs), due to the large GPU memory requirement. To…

Machine Learning · Computer Science 2025-05-29 Dang Nguyen , Wenhan Yang , Rathul Anand , Yu Yang , Baharan Mirzasoleiman

Token Sequence Compression for Efficient Multimodal Computing

The exponential growth of Large Multimodal Models (LMMs) has driven advancements in cross-modal reasoning but at significant computational costs. In this work, we focus on visual language models. We highlight the redundancy and inefficiency…

Computer Vision and Pattern Recognition · Computer Science 2025-04-28 Yasmine Omri , Parth Shroff , Thierry Tambe