Related papers: Alternating Updates for Efficient Transformers

Tricks and Plug-ins for Gradient Boosting with Transformers

Transformer architectures dominate modern NLP but often demand heavy computational resources and intricate hyperparameter tuning. To mitigate these challenges, we propose a novel framework, BoostTransformer, that augments transformers with…

Machine Learning · Computer Science 2025-11-04 Biyi Fang , Truong Vo , Jean Utke , Diego Klabjan

The Augmentation-Speed Tradeoff for Consistent Network Updates

Emerging software-defined networking technologies enable more adaptive communication infrastructures, allowing for quick reactions to changes in networking requirements by exploiting the workload's temporal structure. However, operating…

Networking and Internet Architecture · Computer Science 2022-11-08 Monika Henzinger , Ami Paz , Arash Pourdamghani , Stefan Schmid

Alternating Differentiation for Optimization Layers

The idea of embedding optimization problems into deep neural networks as optimization layers to encode constraints and inductive priors has taken hold in recent years. Most existing methods focus on implicitly differentiating…

Machine Learning · Computer Science 2023-04-25 Haixiang Sun , Ye Shi , Jingya Wang , Hoang Duong Tuan , H. Vincent Poor , Dacheng Tao

Latency Adjustable Transformer Encoder for Language Understanding

Adjusting the latency, power, and accuracy of natural language understanding models is a desirable objective of an efficient architecture. This paper proposes an efficient Transformer architecture that adjusts the inference computational…

Computation and Language · Computer Science 2024-09-20 Sajjad Kachuee , Mohammad Sharifkhani

Atleus: Accelerating Transformers on the Edge Enabled by 3D Heterogeneous Manycore Architectures

Transformer architectures have become the standard neural network model for various machine learning applications including natural language processing and computer vision. However, the compute and memory requirements introduced by…

Hardware Architecture · Computer Science 2025-01-17 Pratyush Dhingra , Janardhan Rao Doppa , Partha Pratim Pande

Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks

Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels. It has shown strong effectiveness in image classification by interpolating images at the pixel level. Inspired by this…

Computation and Language · Computer Science 2020-11-12 Lichao Sun , Congying Xia , Wenpeng Yin , Tingting Liang , Philip S. Yu , Lifang He

AltUB: Alternating Training Method to Update Base Distribution of Normalizing Flow for Anomaly Detection

Unsupervised anomaly detection is coming into the spotlight these days in various practical domains due to the limited amount of anomaly data. One of the major approaches for it is a normalizing flow which pursues the invertible…

Machine Learning · Computer Science 2022-10-28 Yeongmin Kim , Huiwon Jang , DongKeon Lee , Ho-Jin Choi

Enhancing Latent Computation in Transformers with Latent Tokens

Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be…

Machine Learning · Computer Science 2025-05-20 Yuchang Sun , Yanxi Chen , Yaliang Li , Bolin Ding

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at…

Artificial Intelligence · Computer Science 2026-03-13 Qiyang Li , Rui Kong , Yuchen Li , Hengyi Cai , Shuaiqiang Wang , Linghe Kong , Guihai Chen , Dawei Yin

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer…

Machine Learning · Computer Science 2026-04-24 Costin-Andrei Oncescu , Depen Morwani , Samy Jelassi , Alexandru Meterez , Mujin Kwun , Sham Kakade

Alignment Adapter to Improve the Performance of Compressed Deep Learning Models

Compressed Deep Learning (DL) models are essential for deployment in resource-constrained environments. But their performance often lags behind their large-scale counterparts. To bridge this gap, we propose Alignment Adapter (AlAd): a…

Machine Learning · Computer Science 2026-02-17 Rohit Raj Rai , Abhishek Dhaka , Amit Awekar

Speculate Deep and Accurate: Lossless and Training-Free Acceleration for Offloaded LLMs via Substitute Speculative Decoding

The immense model sizes of large language models (LLMs) challenge deployment on memory-limited consumer GPUs. Although model compression and parameter offloading are common strategies to address memory limitations, compression can degrade…

Computation and Language · Computer Science 2025-10-10 Pei-Shuo Wang , Jian-Jia Chen , Chun-Che Yang , Chi-Chih Chang , Ning-Chi Huang , Mohamed S. Abdelfattah , Kai-Chiang Wu

TOAST: Transformer Optimization using Adaptive and Simple Transformations

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining…

Machine Learning · Computer Science 2026-05-19 Irene Cannistraci , Simone Antonelli , Emanuele Palumbo , Thomas M. Sutter , Emanuele Rodolà , Bastian Rieck , Julia E. Vogt

ASPD: Unlocking Adaptive Serial-Parallel Decoding by Exploring Intrinsic Parallelism in LLMs

The increasing scale and complexity of large language models (LLMs) pose significant inference latency challenges, primarily due to their autoregressive decoding paradigm characterized by the sequential nature of next-token prediction. By…

Computation and Language · Computer Science 2025-08-15 Keyu Chen , Zhifeng Shen , Daohai Yu , Haoqian Wu , Wei Wen , Jianfeng He , Ruizhi Qiao , Xing Sun

inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based…

Computation and Language · Computer Science 2026-02-09 Fanshuang Kong , Richong Zhang , Qiyu Sun , Zhijie Nie , Ting Deng , Chunming Hu

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform…

Machine Learning · Computer Science 2025-01-16 Songlin Yang , Bailin Wang , Yu Zhang , Yikang Shen , Yoon Kim

Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this…

Computation and Language · Computer Science 2025-10-30 Bohong Wu , Mengzhao Chen , Xiang Luo , Shen Yan , Qifan Yu , Fan Xia , Tianqi Zhang , Hongrui Zhan , Zheng Zhong , Xun Zhou , Siyuan Qiao , Xingyan Bin

Easy and Efficient Transformer : Scalable Inference Solution For large NLP model

Recently, large-scale transformer-based models have been proven to be effective over various tasks across many domains. Nevertheless, applying them in industrial production requires tedious and heavy works to reduce inference costs. To fill…

Computation and Language · Computer Science 2022-05-25 Gongzheng Li , Yadong Xi , Jingzhen Ding , Duan Wang , Bai Liu , Changjie Fan , Xiaoxi Mao , Zeng Zhao

TokenMixup: Efficient Attention-guided Token-level Data Augmentation for Transformers

Mixup is a commonly adopted data augmentation technique for image classification. Recent advances in mixup methods primarily focus on mixing based on saliency. However, many saliency detectors require intense computation and are especially…

Computer Vision and Pattern Recognition · Computer Science 2022-10-17 Hyeong Kyu Choi , Joonmyung Choi , Hyunwoo J. Kim

Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing…

Computation and Language · Computer Science 2021-06-09 Rabeeh Karimi Mahabadi , Sebastian Ruder , Mostafa Dehghani , James Henderson