Related papers: Latency Adjustable Transformer Encoder for Languag…

Enhancing Latent Computation in Transformers with Latent Tokens

Augmenting large language models (LLMs) with auxiliary tokens has emerged as a promising strategy for enhancing model performance. In this work, we introduce a lightweight method termed latent tokens; these are dummy tokens that may be…

Machine Learning · Computer Science 2025-05-20 Yuchang Sun , Yanxi Chen , Yaliang Li , Bolin Ding

Large Language Model Partitioning for Low-Latency Inference at the Edge

Large Language Models (LLMs) based on autoregressive, decoder-only Transformers generate text one token at a time, where a token represents a discrete unit of text. As each newly produced token is appended to the partial output sequence,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-05-06 Dimitrios Kafetzis , Ramin Khalili , Iordanis Koutsopoulos

A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models

Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to…

Computation and Language · Computer Science 2025-06-03 Jian Gu , Aldeida Aleti , Chunyang Chen , Hongyu Zhang

Cheaply Evaluating Inference Efficiency Metrics for Autoregressive Transformer APIs

Large language models (LLMs) power many state-of-the-art systems in natural language processing. However, these models are extremely computationally expensive, even at inference time, raising the natural question: when is the extra cost of…

Machine Learning · Computer Science 2023-05-05 Deepak Narayanan , Keshav Santhanam , Peter Henderson , Rishi Bommasani , Tony Lee , Percy Liang

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Transformer encoders contextualize token representations by attending to all other tokens at each layer, leading to quadratic increase in compute effort with the input length. In practice, however, the input text of many NLP tasks can be…

Computation and Language · Computer Science 2023-06-01 Jeremiah Milbauer , Annie Louis , Mohammad Javad Hosseini , Alex Fabrikant , Donald Metzler , Tal Schuster

Enhancing Inference Efficiency of Large Language Models: Investigating Optimization Strategies and Architectural Innovations

Large Language Models are growing in size, and we expect them to continue to do so, as larger models train quicker. However, this increase in size will severely impact inference costs. Therefore model compression is important, to retain the…

Machine Learning · Computer Science 2024-04-10 Georgy Tyukin

Fast-FNet: Accelerating Transformer Encoder Models via Efficient Fourier Layers

Transformer-based language models utilize the attention mechanism for substantial performance improvements in almost all natural language processing (NLP) tasks. Similar attention structures are also extensively studied in several other…

Computation and Language · Computer Science 2023-05-17 Nurullah Sevim , Ege Ozan Özyedek , Furkan Şahinuç , Aykut Koç

Adaptive Large Language Models By Layerwise Attention Shortcuts

Transformer architectures are the backbone of the modern AI revolution. However, they are based on simply stacking the same blocks in dozens of layers and processing information sequentially from one block to another. In this paper, we…

Computation and Language · Computer Science 2024-12-24 Prateek Verma , Mert Pilanci

Inference Optimization of Foundation Models on AI Accelerators

Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new…

Artificial Intelligence · Computer Science 2024-10-02 Youngsuk Park , Kailash Budhathoki , Liangfu Chen , Jonas Kübler , Jiaji Huang , Matthäus Kleindessner , Jun Huan , Volkan Cevher , Yida Wang , George Karypis

Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

This paper introduces an efficient strategy to transform Large Language Models (LLMs) into Multi-Modal Large Language Models (MLLMs). By conceptualizing this transformation as a domain adaptation process, i.e., transitioning from text…

Computation and Language · Computer Science 2023-12-19 Bingchen Zhao , Haoqin Tu , Chen Wei , Jieru Mei , Cihang Xie

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference

Fine-tuning and inference with large Language Models (LM) are generally known to be expensive. Parameter-efficient fine-tuning over pretrained LMs reduces training memory by updating a small number of LM parameters but does not improve…

Computation and Language · Computer Science 2024-06-05 Bowen Zhao , Hannaneh Hajishirzi , Qingqing Cao

Communication Compression for Tensor Parallel LLM Inference

Large Language Models (LLMs) have pushed the frontier of artificial intelligence but are comprised of hundreds of billions of parameters and operations. For faster inference latency, LLMs are deployed on multiple hardware accelerators…

Machine Learning · Computer Science 2026-01-07 Jan Hansen-Palmus , Michael Truong Le , Oliver Hausdörfer , Alok Verma

FlashEVA: Accelerating LLM inference via Efficient Attention

Transformer models have revolutionized natural language processing, achieving state-of-the-art performance and demonstrating remarkable scalability. However, their memory demands, particularly due to maintaining full context in memory, pose…

Computation and Language · Computer Science 2025-11-04 Juan Gabriel Kostelec , Qinghai Guo

Eliciting Fine-Tuned Transformer Capabilities via Inference-Time Techniques

Large language models have transformed natural language processing, yet supervised fine-tuning (SFT) remains computationally intensive. This paper formally proves that capabilities acquired through SFT can be approximated by a base…

Machine Learning · Computer Science 2025-06-11 Asankhaya Sharma

Adapting Pretrained Transformer to Lattices for Spoken Language Understanding

Lattices are compact representations that encode multiple hypotheses, such as speech recognition results or different word segmentations. It is shown that encoding lattices as opposed to 1-best results generated by automatic speech…

Computation and Language · Computer Science 2020-11-03 Chao-Wei Huang , Yun-Nung Chen

Consistent Accelerated Inference via Confident Adaptive Transformers

We develop a novel approach for confidently accelerating inference in the large and expensive multilayer Transformers that are now ubiquitous in natural language processing (NLP). Amortized or approximate computational methods increase…

Computation and Language · Computer Science 2021-09-10 Tal Schuster , Adam Fisch , Tommi Jaakkola , Regina Barzilay

Adaptive Computation Modules: Granular Conditional Computation For Efficient Inference

While transformer models have been highly successful, they are computationally inefficient. We observe that for each layer, the full width of the layer may be needed only for a small subset of tokens inside a batch and that the "effective"…

Machine Learning · Computer Science 2024-12-19 Bartosz Wójcik , Alessio Devoto , Karol Pustelnik , Pasquale Minervini , Simone Scardapane

Minimum Latency Training of Sequence Transducers for Streaming End-to-End Speech Recognition

Sequence transducers, such as the RNN-T and the Conformer-T, are one of the most promising models of end-to-end speech recognition, especially in streaming scenarios where both latency and accuracy are important. Although various methods,…

Audio and Speech Processing · Electrical Eng. & Systems 2022-11-07 Yusuke Shinohara , Shinji Watanabe

Quantum Transformer: Accelerating model inference via quantum linear algebra

Powerful generative artificial intelligence from large language models (LLMs) harnesses extensive computational resources for inference. In this work, we investigate the transformer architecture, a key component of these models, under the…

Quantum Physics · Physics 2025-10-30 Naixu Guo , Zhan Yu , Matthew Choi , Yizhan Han , Aman Agrawal , Kouhei Nakaji , Alán Aspuru-Guzik , Patrick Rebentrost

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large…

Computation and Language · Computer Science 2026-05-19 Xuan Zhang , Fengzhuo Zhang , Cunxiao Du , Chao Du , Tianyu Pang , Wei Gao , Min Lin