Related papers: Efficient Sequence Packing without Cross-contamina…

Fast Thinking for Large Language Models

Reasoning-oriented Large Language Models (LLMs) often rely on generating explicit tokens step by step, and their effectiveness typically hinges on large-scale supervised fine-tuning or reinforcement learning. While Chain-of-Thought (CoT)…

Computation and Language · Computer Science 2025-09-30 Haoyu Zheng , Zhuonan Wang , Yuqian Yuan , Tianwei Lin , Wenqiao Zhang , Zheqi Lv , Juncheng Li , Siliang Tang , Yueting Zhuang , Hongyang He

Effects of padding on LSTMs and CNNs

Long Short-Term Memory (LSTM) Networks and Convolutional Neural Networks (CNN) have become very common and are used in many fields as they were effective in solving many problems where the general neural networks were inefficient. They were…

Machine Learning · Computer Science 2019-03-19 Mahidhar Dwarampudi , N V Subba Reddy

Enhancing Document-level Translation of Large Language Model via Translation Mixed-instructions

Existing large language models (LLMs) for machine translation are typically fine-tuned on sentence-level translation instructions and achieve satisfactory performance at the sentence level. However, when applied to document-level…

Computation and Language · Computer Science 2024-01-17 Yachao Li , Junhui Li , Jing Jiang , Min Zhang

Long-Context Attention Benchmark: From Kernel Efficiency to Distributed Context Parallelism

Transformer-based large language models (LLMs) have achieved remarkable success, yet their standard attention mechanism incurs quadratic computation and memory costs with respect to sequence length, posing a major bottleneck for…

Machine Learning · Computer Science 2025-10-22 Tao Bu , Qiangang Wang , Bowen Zeng , Hanwen Sun , Yunpeng Huang , Chun Cao , Jingwei Xu

InfiniPipe: Elastic Pipeline Parallelism for Efficient Variable-Length Long-Context LLM Training

Long context training is crucial for LLM's context extension. Existing schemes, such as sequence parallelism, incur substantial communication overhead. Pipeline parallelism (PP) reduces this cost, but its effectiveness hinges on…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-04-28 Shiju Wang , Yujie Wang , Ao Sun , Fangcheng Fu , Zijian Zhu , Bin Cui , Xu Han , Kaisheng Ma

Sentence-Anchored Gist Compression for Long-Context LLMs

This work investigates context compression for Large Language Models (LLMs) using learned compression tokens to reduce the memory and computational demands of processing long sequences. We demonstrate that pre-trained LLMs can be fine-tuned…

Computation and Language · Computer Science 2025-11-12 Dmitrii Tarasov , Elizaveta Goncharova , Kuznetsov Andrey

One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments

Large Language Models (LLMs) have advanced rapidly but face significant memory demands. While quantization has shown promise for LLMs, current methods typically require lengthy training to alleviate the performance degradation from…

Artificial Intelligence · Computer Science 2024-05-31 Ke Yi , Yuhui Xu , Heng Chang , Chen Tang , Yuan Meng , Tong Zhang , Jia Li

Getting the most out of your tokenizer for pre-training and domain adaptation

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize…

Computation and Language · Computer Science 2024-02-08 Gautier Dagan , Gabriel Synnaeve , Baptiste Rozière

Language Modeling with Learned Meta-Tokens

While modern Transformer-based language models (LMs) have achieved major success in multi-task generalization, they often struggle to capture long-range dependencies within their context window. This work introduces a novel approach using…

Computation and Language · Computer Science 2025-09-23 Alok N. Shah , Khush Gupta , Keshav Ramji , Pratik Chaudhari

Simple and Effective Gradient-Based Tuning of Sequence-to-Sequence Models

Recent trends towards training ever-larger language models have substantially improved machine learning performance across linguistic tasks. However, the huge cost of training larger models can make tuning them prohibitively expensive,…

Computation and Language · Computer Science 2022-09-13 Jared Lichtarge , Chris Alberti , Shankar Kumar

Efficient Training for Cross-lingual Speech Language Models

Currently, large language models (LLMs) predominantly focus on the text modality. To enable more natural human-AI interaction, speech LLMs are emerging, but building effective end-to-end speech LLMs remains challenging due to limited data…

Computation and Language · Computer Science 2026-04-14 Yan Zhou , Qingkai Fang , Yun Hong , Yang Feng

An Empirical Study on Prompt Compression for Large Language Models

Prompt engineering enables Large Language Models (LLMs) to perform a variety of tasks. However, lengthy prompts significantly increase computational complexity and economic costs. To address this issue, we study six prompt compression…

Computation and Language · Computer Science 2025-05-02 Zheng Zhang , Jinyi Li , Yihuai Lan , Xiang Wang , Hao Wang

TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression

Large Language Models (LLMs) have recently achieved remarkable progress by leveraging Reinforcement Learning and extended Chain-of-Thought (CoT) techniques. However, the challenge of performing efficient language reasoning--especially…

Computation and Language · Computer Science 2025-06-17 Zhong-Zhi Li , Xiao Liang , Zihao Tang , Lei Ji , Peijie Wang , Haotian Xu , Xing W , Haizhen Huang , Weiwei Deng , Yeyun Gong , Zhijiang Guo , Xiao Liu , Fei Yin , Cheng-Lin Liu

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Computation in a typical Transformer-based large language model (LLM) can be characterized by batch size, hidden dimension, number of layers, and sequence length. Until now, system works for accelerating LLM training have focused on the…

Machine Learning · Computer Science 2023-10-05 Sam Ade Jacobs , Masahiro Tanaka , Chengming Zhang , Minjia Zhang , Shuaiwen Leon Song , Samyam Rajbhandari , Yuxiong He

Towards Audio Token Compression in Large Audio Language Models

Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and…

Audio and Speech Processing · Electrical Eng. & Systems 2025-11-27 Saurabhchand Bhati , Samuel Thomas , Hilde Kuehne , Rogerio Feris , James Glass

Mixture Compressor for Mixture-of-Experts LLMs Gains More

Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading…

Machine Learning · Computer Science 2025-02-25 Wei Huang , Yue Liao , Jianhui Liu , Ruifei He , Haoru Tan , Shiming Zhang , Hongsheng Li , Si Liu , Xiaojuan Qi

InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory

Large language models (LLMs) have emerged as a cornerstone in real-world applications with lengthy streaming inputs (e.g., LLM-driven agents). However, existing LLMs, pre-trained on sequences with a restricted maximum length, cannot process…

Computation and Language · Computer Science 2024-05-29 Chaojun Xiao , Pengle Zhang , Xu Han , Guangxuan Xiao , Yankai Lin , Zhengyan Zhang , Zhiyuan Liu , Maosong Sun

Sparser Block-Sparse Attention via Token Permutation

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence…

Computation and Language · Computer Science 2026-05-25 Xinghao Wang , Pengyu Wang , Dong Zhang , Chenkun Tan , Shaojun Zhou , Zhaoxiang Liu , Shiguo Lian , Fangxu Liu , Kai Song , Xipeng Qiu

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Modern tokenizers employ deterministic algorithms to map text into a single "canonical" token sequence, yet the same string can be encoded as many non-canonical tokenizations using the tokenizer vocabulary. In this work, we investigate the…

Computation and Language · Computer Science 2026-02-04 Brian Siyuan Zheng , Alisa Liu , Orevaoghene Ahia , Jonathan Hayase , Yejin Choi , Noah A. Smith

Parameter-Efficient Transformer Embeddings

Embedding layers in transformer-based NLP models typically account for the largest share of model parameters, scaling with vocabulary size but not yielding performance gains proportional to scale. We propose an alternative approach in which…

Computation and Language · Computer Science 2025-05-06 Henry Ndubuaku , Mouad Talhi