English
Related papers

Related papers: FlashNorm: Fast Normalization for Transformers

200 papers

Matrix multiplication performance has long been the major bottleneck to scaling deep learning workloads, which has stimulated the design of new accelerators that use increasingly low-precision number formats. However, improvements in matrix…

Machine Learning · Computer Science 2026-03-16 Callum McLean , Luke Y. Prince , Alexandre Payot , Paul Balança , Carlo Luschi

Transformers have achieved great success in machine learning applications. Normalization techniques, such as Layer Normalization (LayerNorm, LN) and Root Mean Square Normalization (RMSNorm), play a critical role in accelerating and…

Machine Learning · Computer Science 2023-10-27 Zixuan Jiang , Jiaqi Gu , Hanqing Zhu , David Z. Pan

Layer normalization (LN) is a fundamental component in modern deep learning, but its per-sample centering and scaling introduce non-negligible inference overhead. RMSNorm improves efficiency by removing the centering operation, yet this may…

Machine Learning · Computer Science 2026-05-15 Yuxin Guo , Yihao Yue , Yunhao Ni , Yizhou Ruan , Jie Luo , Wenjun Wu , Lei Huang

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight…

Machine Learning · Computer Science 2019-10-17 Biao Zhang , Rico Sennrich

The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated…

Computation and Language · Computer Science 2025-02-26 Mahsa Salmani , Ilya Soloveychik

The ever increasing sizes of Large Language Models (LLMs) beyond hundreds of billions of parameters have generated enormous pressure on the manufacturers of dedicated hardware accelerators and made the innovative design of the latter one of…

Machine Learning · Computer Science 2024-10-15 Mahsa Salmani , Nikita Trukhanov , Ilya Soloveychik

As the Large Language Model (LLM) becomes increasingly important in various domains. However, the following challenges still remain unsolved in accelerating LLM inference: (1) Synchronized partial softmax update. The softmax operation…

Machine Learning · Computer Science 2024-01-08 Ke Hong , Guohao Dai , Jiaming Xu , Qiuli Mao , Xiuhong Li , Jun Liu , Kangdi Chen , Yuhan Dong , Yu Wang

In Transformer models, non-GEMM (non-General Matrix Multiplication) operations -- especially Softmax and Layer Normalization (LayerNorm) -- often dominate hardware cost due to their nonlinear nature. To address this, previous approximation…

Hardware Architecture · Computer Science 2026-04-28 Dawon Choi , Hana Kim , Ji-Hoon Kim

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-09-20 Haojun Xia , Zhen Zheng , Yuchao Li , Donglin Zhuang , Zhongzhu Zhou , Xiafei Qiu , Yong Li , Wei Lin , Shuaiwen Leon Song

When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden…

Hardware Architecture · Computer Science 2022-11-08 Seock-Hwan Noh , Junsang Park , Dahoon Park , Jahyun Koo , Jeik Choi , Jaeha Kung

We present TransNormerLLM, the first linear attention-based Large Language Model (LLM) that outperforms conventional softmax attention-based models in terms of both accuracy and efficiency. TransNormerLLM evolves from the previous linear…

Computation and Language · Computer Science 2024-01-22 Zhen Qin , Dong Li , Weigao Sun , Weixuan Sun , Xuyang Shen , Xiaodong Han , Yunshen Wei , Baohong Lv , Xiao Luo , Yu Qiao , Yiran Zhong

The increasing size and complexity of modern deep neural networks (DNNs) pose significant challenges for on-device inference on mobile GPUs, with limited memory and computational resources. Existing DNN acceleration frameworks primarily…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-02-18 Zhihao Shu , Md Musfiqur Rahman Sanim , Hangyu Zheng , Kunxiong Zhu , Miao Yin , Gagan Agrawal , Wei Niu

Transformer-based large language models are a memory-bound model whose operation is based on a large amount of data that are marginally reused. Thus, the data movement between a host and accelerator likely dictates the total wall-clock…

Machine Learning · Computer Science 2025-01-20 ChangMin Ye , Yonguk Sim , Youngchae Kim , SeongMin Jin , Doo Seok Jeong

Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul)…

LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data…

Machine Learning · Computer Science 2026-03-31 Sungbae Chun

In order to reduce the computational complexity of large language models, great efforts have been made to to improve the efficiency of transformer models such as linear attention and flash-attention. However, the model size and…

Computation and Language · Computer Science 2026-02-04 Ning Ding , Yehui Tang , Haochen Qin , Zhenli Zhou , Chao Xu , Lin Li , Kai Han , Heng Liao , Yunhe Wang

While Transformers and other sequence-parallelizable neural network architectures seem like the current state of the art in sequence modeling, they specifically lack state-tracking capabilities. These are important for time-series tasks and…

Machine Learning · Computer Science 2025-03-14 Korbinian Pöppel , Maximilian Beck , Sepp Hochreiter

Sampling from a categorical distribution is mathematically simple, but in large-vocabulary decoding, it often triggers extra memory traffic and extra kernels after the LM head. We present FlashSampling, an exact sampling primitive that…

Machine Learning · Computer Science 2026-05-14 Tomas Ruiz , Zhen Qin , Yifan Zhang , Xuyang Shen , Yiran Zhong , Mengdi Wang

The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the…

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness…

Machine Learning · Computer Science 2019-11-19 Jingjing Xu , Xu Sun , Zhiyuan Zhang , Guangxiang Zhao , Junyang Lin
‹ Prev 1 2 3 10 Next ›