Related papers: FlashNorm: Fast Normalization for Transformers

MXNorm: Reusing MXFP block scales for efficient tensor normalisation

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix…

Machine Learning · Computer Science 2026-03-16 Callum McLean , Luke Y. Prince , Alexandre Payot , Paul Balança , Carlo Luschi

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and…

Machine Learning · Computer Science 2023-10-27 Zixuan Jiang , Jiaqi Gu , Hanqing Zhu , David Z. Pan

Enjoy Your Layer Normalization with the Computational Efficiency of RMSNorm

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may…

Machine Learning · Computer Science 2026-05-15 Yuxin Guo , Yihao Yue , Yunhao Ni , Yizhou Ruan , Jie Luo , Wenjun Wu , Lei Huang

Root Mean Square Layer Normalization

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight…

Machine Learning · Computer Science 2019-10-17 Biao Zhang , Rico Sennrich

LLM Inference Acceleration via Efficient Operation Fusion

The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated…

Computation and Language · Computer Science 2025-02-26 Mahsa Salmani , Ilya Soloveychik

SLaNC: Static LayerNorm Calibration

The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of…

Machine Learning · Computer Science 2024-10-15 Mahsa Salmani , Nikita Trukhanov , Ilya Soloveychik

FlashDecoding++: Faster Large Language Model Inference on GPUs

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

In Transformer models, non-GEMM (non-General Matrix Multiplication) operations -- especially Softmax and Layer Normalization (LayerNorm) -- often dominate hardware cost due to their nonlinear nature. To address this, previous approximation…

Hardware Architecture · Computer Science 2026-04-28 Dawon Choi , Hana Kim , Ji-Hoon Kim

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-20 Haojun Xia , Zhen Zheng , Yuchao Li , Donglin Zhuang , Zhongzhu Zhou , Xiafei Qiu , Yong Li , Wei Lin , Shuaiwen Leon Song

LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training

When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden…

Hardware Architecture · Computer Science 2022-11-08 Seock-Hwan Noh , Junsang Park , Dahoon Park , Jahyun Koo , Jeik Choi , Jaeha Kung

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear…

Computation and Language · Computer Science 2024-01-22 Zhen Qin , Dong Li , Weigao Sun , Weixuan Sun , Xuyang Shen , Xiaodong Han , Yunshen Wei , Baohong Lv , Xiao Luo , Yu Qiao , Yiran Zhong

FlashMem: Supporting Modern DNN Workloads on Mobile with GPU Memory Hierarchy Optimizations

The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-18 Zhihao Shu , Md Musfiqur Rahman Sanim , Hangyu Zheng , Kunxiong Zhu , Miao Yin , Gagan Agrawal , Wei Niu

IterL2Norm: Fast Iterative L2-Normalization

Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock…

Machine Learning · Computer Science 2025-01-20 ChangMin Ye , Yonguk Sim , Youngchae Kim , SeongMin Jin , Doo Seok Jeong

Scalable MatMul-free Language Modeling

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul)…

Computation and Language · Computer Science 2025-07-29 Rui-Jie Zhu , Yu Zhang , Steven Abreu , Ethan Sifferman , Tyler Sheaves , Yiqiao Wang , Dustin Richmond , Sumit Bam Shrestha , Peng Zhou , Jason K. Eshraghian

The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data…

Machine Learning · Computer Science 2026-03-31 Sungbae Chun

MemoryFormer: Minimize Transformer Computation by Removing Fully-Connected Layers

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and…

Computation and Language · Computer Science 2026-02-04 Ning Ding , Yehui Tang , Haochen Qin , Zhenli Zhou , Chao Xu , Lin Li , Kai Han , Heng Liao , Yunhe Wang

FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware

While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and…

Machine Learning · Computer Science 2025-03-14 Korbinian Pöppel , Maximilian Beck , Sepp Hochreiter

FlashSampling: Fast and Memory-Efficient Exact Sampling

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that…

Machine Learning · Computer Science 2026-05-14 Tomas Ruiz , Zhen Qin , Yifan Zhang , Xuyang Shen , Yiran Zhong , Mengdi Wang

FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition

The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the…

Sound · Computer Science 2022-11-01 Xingchen Song , Di Wu , Binbin Zhang , Zhiyong Wu , Wenpeng Li , Dongfang Li , Pengshen Zhang , Zhendong Peng , Fuping Pan , Changbao Zhu , Zhongqin Wu

Understanding and Improving Layer Normalization

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness…

Machine Learning · Computer Science 2019-11-19 Jingjing Xu , Xu Sun , Zhiyuan Zhang , Guangxiang Zhao , Junyang Lin