Related papers: Accelerating Attention through Gradient-Based Lear…

Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the…

Machine Learning · Computer Science 2022-09-02 Amir Yazdanbakhsh , Ashkan Moradifirouzabadi , Zheng Li , Mingu Kang

Focus on the Core: Efficient Attention via Pruned Token Compression for Document Classification

Transformer-based models have achieved dominant performance in numerous NLP tasks. Despite their remarkable successes, pre-trained transformers such as BERT suffer from a computationally expensive self-attention mechanism that interacts…

Computation and Language · Computer Science 2024-06-04 Jungmin Yun , Mihyeon Kim , Youngbin Kim

ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism

Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows…

Machine Learning · Computer Science 2025-07-01 Venmugil Elango

High-Layer Attention Pruning with Rescaling

Pruning is a highly effective approach for compressing large language models (LLMs), significantly reducing inference latency. However, conventional training-free structured pruning methods often employ a heuristic metric that…

Computation and Language · Computer Science 2026-01-28 Songtao Liu , Peng Liu

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve…

Computation and Language · Computer Science 2024-06-05 Bowen Zhao , Hannaneh Hajishirzi , Qingqing Cao

DART-ing Through the Drift: Dynamic Tracing of Knowledge Neurons for Adaptive Inference-Time Pruning

Large Language Models (LLMs) exhibit substantial parameter redundancy, particularly in Feed-Forward Networks (FFNs). Existing pruning methods suffer from two primary limitations. First, reliance on dataset-specific calibration introduces…

Computation and Language · Computer Science 2026-02-02 Abhishek Tyagi , Yunuo Cen , Shrey Dhorajiya , Bharadwaj Veeravalli , Xuanyao Fong

Treeformer: Dense Gradient Trees for Efficient Attention Computation

Standard inference and training with transformer based architectures scale quadratically with input sequence length. This is prohibitively large for a variety of applications especially in web-page translation, query-answering etc.…

Computation and Language · Computer Science 2023-03-20 Lovish Madaan , Srinadh Bhojanapalli , Himanshu Jain , Prateek Jain

Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach

We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from $n$ auxiliary tasks. This problem has broad applications in NLP, such as targeted instruction tuning and data selection in…

Computation and Language · Computer Science 2025-06-03 Dongyue Li , Ziniu Zhang , Lu Wang , Hongyang R. Zhang

Task-oriented Memory-efficient Pruning-Adapter

The Outstanding performance and growing size of Large Language Models has led to increased attention in parameter efficient learning. The two predominant approaches are Adapters and Pruning. Adapters are to freeze the model and give it a…

Computation and Language · Computer Science 2023-04-07 Guorun Wang , Jun Yang , Yaoru Sun

Gradient-based Intra-attention Pruning on Pre-trained Language Models

Pre-trained language models achieve superior performance but are computationally expensive. Techniques such as pruning and knowledge distillation have been developed to reduce their sizes and latencies. In this work, we propose a structured…

Computation and Language · Computer Science 2023-05-19 Ziqing Yang , Yiming Cui , Xin Yao , Shijin Wang

LEAP: Learnable Pruning for Transformer-based Models

Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models. However, current pruning algorithms either only focus on one pruning category, e.g., structured…

Computation and Language · Computer Science 2022-05-24 Zhewei Yao , Xiaoxia Wu , Linjian Ma , Sheng Shen , Kurt Keutzer , Michael W. Mahoney , Yuxiong He

Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Transformer-based models have emerged as one of the most widely used architectures for natural language processing, natural language generation, and image generation. The size of the state-of-the-art models has increased steadily reaching…

Hardware Architecture · Computer Science 2025-01-15 Rya Sanovar , Srikant Bharadwaj , Renee St. Amant , Victor Rühle , Saravan Rajmohan

Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-EndSpeech Recognition

End-to-end models are favored in automatic speech recognition (ASR) because of their simplified system structure and superior performance. Among these models, Transformer and Conformer have achieved state-of-the-art recognition accuracy in…

Sound · Computer Science 2021-06-18 Xiong Wang , Sining Sun , Lei Xie , Long Ma

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model…

Machine Learning · Computer Science 2025-02-27 Yingyu Liang , Jiangxuan Long , Zhenmei Shi , Zhao Song , Yufa Zhou

Neural Language Model Pruning for Automatic Speech Recognition

We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition. We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their…

Machine Learning · Computer Science 2023-10-06 Leonardo Emili , Thiago Fraga-Silva , Ernest Pusateri , Markus Nußbaum-Thom , Youssef Oualil

FineText: Text Classification via Attention-based Language Model Fine-tuning

Training deep neural networks from scratch on natural language processing (NLP) tasks requires significant amount of manually labeled text corpus and substantial time to converge, which usually cannot be satisfied by the customers. In this…

Computation and Language · Computer Science 2019-10-29 Yunzhe Tao , Saurabh Gupta , Satyapriya Krishna , Xiong Zhou , Orchid Majumder , Vineet Khare

LearnPruner: Rethinking Attention-based Token Pruning in Vision Language Models

Vision-Language Models (VLMs) have recently demonstrated remarkable capabilities in visual understanding and reasoning, but they also impose significant computational burdens due to long visual sequence inputs. Recent works address this…

Computer Vision and Pattern Recognition · Computer Science 2026-04-28 Rinyoichi Takezoe , Yaqian Li , Zihao Bo , Anzhou Hou , Mo Guang , Kaiwen Long

Adaptive Sparse ViT: Towards Learnable Adaptive Token Pruning by Fully Exploiting Self-Attention

Vision transformer has emerged as a new paradigm in computer vision, showing excellent performance while accompanied by expensive computational cost. Image token pruning is one of the main approaches for ViT compression, due to the facts…

Computer Vision and Pattern Recognition · Computer Science 2023-07-07 Xiangcheng Liu , Tianyi Wu , Guodong Guo

Input-length-shortening and text generation via attention values

Identifying words that impact a task's performance more than others is a challenge in natural language processing. Transformers models have recently addressed this issue by incorporating an attention mechanism that assigns greater attention…

Computation and Language · Computer Science 2023-03-15 Neşet Özkan Tan , Alex Yuxuan Peng , Joshua Bensemann , Qiming Bao , Tim Hartill , Mark Gahegan , Michael Witbrock

Know What You Don't Need: Single-Shot Meta-Pruning for Attention Heads

Deep pre-trained Transformer models have achieved state-of-the-art results over a variety of natural language processing (NLP) tasks. By learning rich language knowledge with millions of parameters, these models are usually…

Computation and Language · Computer Science 2020-11-10 Zhengyan Zhang , Fanchao Qi , Zhiyuan Liu , Qun Liu , Maosong Sun