Related papers: DiJiang: Efficient Large Language Models through C…

Linear Self-Attention Approximation via Trainable Feedforward Kernel

In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce…

Machine Learning · Computer Science 2022-11-09 Uladzislau Yorsh , Alexander Kovalenko

SiLQ: Simple Large Language Model Quantization-Aware Training

Large language models can be quantized to reduce inference time latency, model size, and energy consumption, thereby delivering a better user experience at lower cost. A challenge exists to deliver quantized models with minimal loss of…

Machine Learning · Computer Science 2025-07-24 Steven K. Esser , Jeffrey L. McKinstry , Deepika Bablani , Rathinakumar Appuswamy , Dharmendra S. Modha

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when…

Computer Vision and Pattern Recognition · Computer Science 2024-12-23 Songhua Liu , Zhenxiong Tan , Xinchao Wang

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers

Pre-trained language models (e.g., BERT (Devlin et al., 2018) and its variants) have achieved remarkable success in varieties of NLP tasks. However, these models usually consist of hundreds of millions of parameters which brings challenges…

Computation and Language · Computer Science 2020-04-07 Wenhui Wang , Furu Wei , Li Dong , Hangbo Bao , Nan Yang , Ming Zhou

Efficient Heterogeneous Large Language Model Decoding with Model-Attention Disaggregation

Transformer-based large language models (LLMs) exhibit impressive performance in generative tasks but also introduce significant challenges in real-world serving due to inefficient use of the expensive, computation-optimized accelerators.…

Machine Learning · Computer Science 2025-04-11 Shaoyuan Chen , Wencong Xiao , Yutong Lin , Mingxing Zhang , Yingdi Shan , Jinlei Jiang , Kang Chen , Yongwei Wu

Quantization-Aware and Tensor-Compressed Training of Transformers for Natural Language Understanding

Fine-tuned transformer models have shown superior performances in many natural language tasks. However, the large model size prohibits deploying high-performance transformer models on resource-constrained devices. This paper proposes a…

Computation and Language · Computer Science 2024-10-01 Zi Yang , Samridhi Choudhary , Siegfried Kunzmann , Zheng Zhang

Exploring Quantization for Efficient Pre-Training of Transformer Language Models

The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in…

Machine Learning · Computer Science 2024-10-14 Kamran Chitsaz , Quentin Fournier , Gonçalo Mordido , Sarath Chandar

Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of…

Computation and Language · Computer Science 2026-02-25 Sachin Gopal Wani , Eric Page , Ajay Dholakia , David Ellison

Bridging the Gap for Tokenizer-Free Language Models

Purely character-based language models (LMs) have been lagging in quality on large scale datasets, and current state-of-the-art LMs rely on word tokenization. It has been assumed that injecting the prior knowledge of a tokenizer into the…

Computation and Language · Computer Science 2019-08-28 Dokook Choe , Rami Al-Rfou , Mandy Guo , Heeyoung Lee , Noah Constant

LLM-QAT: Data-Free Quantization Aware Training for Large Language Models

Several post-training quantization methods have been applied to large language models (LLMs), and have been shown to perform well down to 8-bits. We find that these methods break down at lower bit precision, and investigate quantization…

Computation and Language · Computer Science 2023-05-30 Zechun Liu , Barlas Oguz , Changsheng Zhao , Ernie Chang , Pierre Stock , Yashar Mehdad , Yangyang Shi , Raghuraman Krishnamoorthi , Vikas Chandra

FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation

This work presents a Fully BInarized Large Language Model (FBI-LLM), demonstrating for the first time how to train a large-scale binary language model from scratch (not the partial binary or ternary LLM like BitNet b1.58) to match the…

Computation and Language · Computer Science 2024-07-10 Liqun Ma , Mingjie Sun , Zhiqiang Shen

Scavenging Hyena: Distilling Transformers into Long Convolution Models

The rapid evolution of Large Language Models (LLMs), epitomized by architectures like GPT-4, has reshaped the landscape of natural language processing. This paper introduces a pioneering approach to address the efficiency concerns…

Computation and Language · Computer Science 2024-02-01 Tokiniaina Raharison Ralambomihanta , Shahrad Mohammadzadeh , Mohammad Sami Nur Islam , Wassim Jabbour , Laurence Liang

BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation

The upscaling of Large Language Models (LLMs) has yielded impressive advances in natural language processing, yet it also poses significant deployment challenges. Weight quantization has emerged as a widely embraced solution to reduce…

Computation and Language · Computer Science 2024-02-19 Dayou Du , Yijia Zhang , Shijie Cao , Jiaqi Guo , Ting Cao , Xiaowen Chu , Ningyi Xu

LAWCAT: Efficient Distillation from Quadratic to Linear Attention with Convolution across Tokens for Long Context Modeling

Although transformer architectures have achieved state-of-the-art performance across diverse domains, their quadratic computational complexity with respect to sequence length remains a significant bottleneck, particularly for…

Computation and Language · Computer Science 2025-11-05 Zeyu Liu , Souvik Kundu , Lianghao Jiang , Anni Li , Srikanth Ronanki , Sravan Bodapati , Gourav Datta , Peter A. Beerel

Kernel Transform Learning

This work proposes kernel transform learning. The idea of dictionary learning is well known; it is a synthesis formulation where a basis is learnt along with the coefficients so as to generate or synthesize the data. Transform learning is…

Computer Vision and Pattern Recognition · Computer Science 2020-08-10 Jyoti Maggu , Angshul Majumdar

LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation

Deploying large language models (LLMs) in resource-constrained environments is hindered by heavy computational and memory requirements. We present LBLLM, a lightweight binarization framework that achieves effective W(1+1)A4 quantization…

Machine Learning · Computer Science 2026-04-22 Siqing Song , Chuang Wang , Yong Lang , Yi Yang , Xu-Yao Zhang

Transformer Dissection: A Unified Understanding of Transformer's Attention via the Lens of Kernel

Transformer is a powerful architecture that achieves superior performance on various sequence learning tasks, including neural machine translation, language understanding, and sequence prediction. At the core of the Transformer is the…

Machine Learning · Computer Science 2019-11-13 Yao-Hung Hubert Tsai , Shaojie Bai , Makoto Yamada , Louis-Philippe Morency , Ruslan Salakhutdinov

YuLan-Mini: An Open Data-efficient Language Model

Effective pre-training of large language models (LLMs) has been challenging due to the immense resource demands and the complexity of the technical processes involved. This paper presents a detailed technical report on YuLan-Mini, a highly…

Computation and Language · Computer Science 2024-12-25 Yiwen Hu , Huatong Song , Jia Deng , Jiapeng Wang , Jie Chen , Kun Zhou , Yutao Zhu , Jinhao Jiang , Zican Dong , Wayne Xin Zhao , Ji-Rong Wen

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

Current LLM structured pruning methods typically involve two steps: (1) compression with calibration data and (2) costly continued pretraining on billions of tokens to recover lost performance. This second step is necessary as the first…

Machine Learning · Computer Science 2024-12-31 Yaya Sy , Christophe Cerisara , Irina Illina

CDLM: Consistency Diffusion Language Models For Faster Sampling

Diffusion Language Models (DLMs) offer a promising parallel generation paradigm but suffer from slow inference due to numerous refinement steps and the inability to use standard KV caching. We introduce CDLM (Consistency Diffusion Language…

Machine Learning · Computer Science 2026-02-23 Minseo Kim , Chenfeng Xu , Coleman Hooper , Harman Singh , Ben Athiwaratkun , Ce Zhang , Kurt Keutzer , Amir Gholami