Related papers: Efficient Sequence Packing without Cross-contamina…

Improving BERT Fine-tuning with Embedding Normalization

Large pre-trained sentence encoders like BERT start a new chapter in natural language processing. A common practice to apply pre-trained BERT to sequence classification tasks (e.g., classification of sentences or sentence pairs) is by…

Computation and Language · Computer Science 2020-02-26 Wenxuan Zhou , Junyi Du , Xiang Ren

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale

In recent years, Large Language Models (LLMs) have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In…

Computation and Language · Computer Science 2024-10-03 Wenzhen Zheng , Wenbo Pan , Xu Xu , Libo Qin , Li Yue , Ming Zhou

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training…

Computation and Language · Computer Science 2025-11-04 Chun-Hao Yang , Bo-Han Feng , Tzu-Yuan Lai , Yan Yu Chen , Yin-Kai Dean Huang , Shou-De Lin

Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline

Large language models (LLMs) have revolutionized the field of AI, demonstrating unprecedented capacity across various tasks. However, the inference process for LLMs comes with significant computational costs. In this paper, we propose an…

Computation and Language · Computer Science 2023-05-30 Zangwei Zheng , Xiaozhe Ren , Fuzhao Xue , Yang Luo , Xin Jiang , Yang You

BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences

Effective attention modules have played a crucial role in the success of Transformer-based large language models (LLMs), but the quadratic time and memory complexities of these attention modules also pose a challenge when processing long…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-06-07 Ao Sun , Weilin Zhao , Xu Han , Cheng Yang , Zhiyuan Liu , Chuan Shi , Maosong Sun

Efficient Sequential Decision Making with Large Language Models

This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. The former…

Machine Learning · Computer Science 2025-06-17 Dingyang Chen , Qi Zhang , Yinglun Zhu

Improving Continual Pre-training Through Seamless Data Packing

Continual pre-training has demonstrated significant potential in enhancing model performance, particularly in domain-specific scenarios. The most common approach for packing data before continual pre-training involves concatenating input…

Computation and Language · Computer Science 2025-05-30 Ruicheng Yin , Xuan Gao , Changze Lv , Xiaohua Wang , Xiaoqing Zheng , Xuanjing Huang

Making Large Language Models Better Data Creators

Although large language models (LLMs) have advanced the state-of-the-art in NLP significantly, deploying them for downstream applications is still challenging due to cost, responsiveness, control, or concerns around privacy and security. As…

Computation and Language · Computer Science 2023-11-01 Dong-Ho Lee , Jay Pujara , Mohit Sewak , Ryen W. White , Sujay Kumar Jauhar

Asynchronous Training of Word Embeddings for Large Text Corpora

Word embeddings are a powerful approach for analyzing language and have been widely popular in numerous tasks in information retrieval and text mining. Training embeddings over huge corpora is computationally expensive because the input is…

Machine Learning · Computer Science 2018-12-11 Avishek Anand , Megha Khosla , Jaspreet Singh , Jan-Hendrik Zab , Zijian Zhang

Span Fine-tuning for Pre-trained Language Models

Pre-trained language models (PrLM) have to carefully manage input units when training on a very large text with a vocabulary consisting of millions of words. Previous works have shown that incorporating span-level information over…

Computation and Language · Computer Science 2021-09-16 Rongzhou Bao , Zhuosheng Zhang , Hai Zhao

Mask More and Mask Later: Efficient Pre-training of Masked Language Models by Disentangling the [MASK] Token

The pre-training of masked language models (MLMs) consumes massive computation to achieve good results on downstream NLP tasks, resulting in a large carbon footprint. In the vanilla MLM, the virtual tokens, [MASK]s, act as placeholders and…

Computation and Language · Computer Science 2022-11-16 Baohao Liao , David Thulke , Sanjika Hewavitharana , Hermann Ney , Christof Monz

Enhancing Training Efficiency Using Packing with Flash Attention

Padding is often used in tuning LLM models by adding special tokens to shorter training examples to match the length of the longest sequence in each batch. While this ensures uniformity for batch processing, it introduces inefficiencies by…

Machine Learning · Computer Science 2024-09-04 Achintya Kundu , Rhui Dih Lee , Laura Wynter , Raghu Kiran Ganti , Mayank Mishra

Breaking the Token Barrier: Chunking and Convolution for Efficient Long Text Classification with BERT

Transformer-based models, specifically BERT, have propelled research in various NLP tasks. However, these models are limited to a maximum token limit of 512 tokens. Consequently, this makes it non-trivial to apply it in a practical setting…

Computation and Language · Computer Science 2023-11-01 Aman Jaiswal , Evangelos Milios

PaceLLM: Brain-Inspired Large Language Models for Long-Context Understanding

While Large Language Models (LLMs) demonstrate strong performance across domains, their long-context capabilities are limited by transient neural activations causing information decay and unstructured feed-forward network (FFN) weights…

Neurons and Cognition · Quantitative Biology 2026-04-13 Kangcong Li , Peng Ye , Chongjun Tu , Lin Zhang , Chunfeng Song , Jiamin Wu , Tao Yang , Qihao Zheng , Tao Chen

Learning to Parallel: Accelerating Diffusion Large Language Models via Learnable Parallel Decoding

Autoregressive decoding in large language models (LLMs) requires $\mathcal{O}(n)$ sequential steps for $n$ tokens, fundamentally limiting inference throughput. Recent diffusion-based LLMs (dLLMs) enable parallel token generation through…

Computation and Language · Computer Science 2025-10-06 Wenrui Bao , Zhiben Chen , Dan Xu , Yuzhang Shang

SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM

While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass…

Computation and Language · Computer Science 2024-08-26 Quandong Wang , Yuxuan Yuan , Xiaoyu Yang , Ruike Zhang , Kang Zhao , Wei Liu , Jian Luan , Daniel Povey , Bin Wang

Robust Transfer Learning with Pretrained Language Models through Adapters

Transfer learning with large pretrained transformer-based language models like BERT has become a dominating approach for most NLP tasks. Simply fine-tuning those large language models on downstream tasks or combining it with task-specific…

Computation and Language · Computer Science 2021-08-06 Wenjuan Han , Bo Pang , Yingnian Wu

Achieving Peak Performance for Large Language Models: A Systematic Review

In recent years, large language models (LLMs) have achieved remarkable success in natural language processing (NLP). LLMs require an extreme amount of parameters to attain high performance. As models grow into the trillion-parameter range,…

Computation and Language · Computer Science 2024-09-10 Zhyar Rzgar K Rostam , Sándor Szénási , Gábor Kertész

Probabilistic Token Alignment for Large Language Model Fusion

Training large language models (LLMs) from scratch can yield models with unique functionalities and strengths, but it is costly and often leads to redundant capabilities. A more cost-effective alternative is to fuse existing pre-trained…

Computation and Language · Computer Science 2025-09-23 Runjia Zeng , James Chenhao Liang , Cheng Han , Zhiwen Cao , Jiahao Liu , Xiaojun Quan , Yingjie Victor Chen , Lifu Huang , Tong Geng , Qifan Wang , Dongfang Liu

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models

Large language models (LLMs) have been applied in various applications due to their astonishing capabilities. With advancements in technologies such as chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed to LLMs…

Computation and Language · Computer Science 2023-12-07 Huiqiang Jiang , Qianhui Wu , Chin-Yew Lin , Yuqing Yang , Lili Qiu