Related papers: Key-Value Transformer

Exploring the Integration of Key-Value Attention Into Pure and Hybrid Transformers for Semantic Segmentation

While CNNs were long considered state of the art for image processing, the introduction of Transformer architectures has challenged this position. While achieving excellent results in image classification and segmentation, Transformers…

Computer Vision and Pattern Recognition · Computer Science 2025-03-25 DeShin Hwa , Tobias Holmes , Klaus Drechsler

A Survey of Visual Transformers

Transformer, an attention-based encoder-decoder model, has already revolutionized the field of natural language processing (NLP). Inspired by such significant achievements, some pioneering works have recently been done on employing…

Computer Vision and Pattern Recognition · Computer Science 2022-12-07 Yang Liu , Yao Zhang , Yixin Wang , Feng Hou , Jin Yuan , Jiang Tian , Yang Zhang , Zhongchao Shi , Jianping Fan , Zhiqiang He

Low-Rank Key Value Attention

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each…

Machine Learning · Computer Science 2026-04-09 James O'Neill , Robert Clancy , Mariia Matskevichus , Fergal Reid

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which…

Machine Learning · Computer Science 2025-05-27 Farnoosh Javadi , Walid Ahmed , Habib Hajimolahoseini , Foozhan Ataiefard , Mohammad Hassanpour , Saina Asani , Austin Wen , Omar Mohamed Awad , Kangling Liu , Yang Liu

Quantum Vision Transformers

In this work, quantum transformers are designed and analysed in detail by extending the state-of-the-art classical transformer neural network architectures known to be very performant in natural language processing and image analysis.…

Quantum Physics · Physics 2024-02-28 El Amine Cherrat , Iordanis Kerenidis , Natansh Mathur , Jonas Landman , Martin Strahm , Yun Yvonna Li

Vision Transformer with Quadrangle Attention

Window-based attention has become a popular choice in vision transformers due to its superior performance, lower computational complexity, and less memory footprint. However, the design of hand-crafted windows, which is data-agnostic,…

Computer Vision and Pattern Recognition · Computer Science 2023-03-28 Qiming Zhang , Jing Zhang , Yufei Xu , Dacheng Tao

Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which…

Machine Learning · Computer Science 2024-04-09 Muhammad Adnan , Akhil Arunkumar , Gaurav Jain , Prashant J. Nair , Ilya Soloveychik , Purushotham Kamath

Fair Comparison between Efficient Attentions

Transformers have been successfully used in various fields and are becoming the standard tools in computer vision. However, self-attention, a core component of transformers, has a quadratic complexity problem, which limits the use of…

Computer Vision and Pattern Recognition · Computer Science 2022-06-02 Jiuk Hong , Chaehyeon Lee , Soyoun Bang , Heechul Jung

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing,…

Machine Learning · Computer Science 2024-10-16 Zayd Muhammad Kawakibi Zuhri , Muhammad Farid Adilazuarda , Ayu Purwarianti , Alham Fikri Aji

QV May Be Enough: Toward the Essence of Attention in LLMs

Starting from first principles and a linguistic perspective centered on part-of-speech (POS) and syntactic analysis, this paper explores and derives the underlying essence of the Query-Key-Value (QKV) mechanism within the Transformer…

Artificial Intelligence · Computer Science 2026-03-18 Zhang Edward

Rethinking Query, Key, and Value Embedding in Vision Transformer under Tiny Model Constraints

A vision transformer (ViT) is the dominant model in the computer vision field. Despite numerous studies that mainly focus on dealing with inductive bias and complexity, there remains the problem of finding better transformer networks. For…

Computer Vision and Pattern Recognition · Computer Science 2023-05-01 Jaesin Ahn , Jiuk Hong , Jeongwoo Ju , Heechul Jung

Keyword Transformer: A Self-Attention Model for Keyword Spotting

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or…

Audio and Speech Processing · Electrical Eng. & Systems 2022-04-11 Axel Berg , Mark O'Connor , Miguel Tairum Cruz

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Key-value (KV) caching plays an essential role in accelerating decoding for transformer-based autoregressive large language models (LLMs). However, the amount of memory required to store the KV cache can become prohibitive at long sequence…

Machine Learning · Computer Science 2024-05-22 William Brandon , Mayank Mishra , Aniruddha Nrusimha , Rameswar Panda , Jonathan Ragan Kelly

Machine Learning for Brain Disorders: Transformers and Visual Transformers

Transformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in…

Computer Vision and Pattern Recognition · Computer Science 2023-03-22 Robin Courant , Maika Edberg , Nicolas Dufour , Vicky Kalogeiton

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major…

Machine Learning · Computer Science 2025-12-08 Damien Lesens , Beheshteh T. Rakhshan , Guillaume Rabusseau

Reducing the Transformer Architecture to a Minimum

Transformers are a widespread and successful model architecture, particularly in Natural Language Processing (NLP) and Computer Vision (CV). The essential innovation of this architecture is the Attention Mechanism, which solves the problem…

Machine Learning · Computer Science 2024-11-25 Bernhard Bermeitinger , Tomas Hrycej , Massimo Pavone , Julianus Kath , Siegfried Handschuh

KVT: k-NN Attention for Boosting Vision Transformers

Convolutional Neural Networks (CNNs) have dominated computer vision for years, due to its ability in capturing locality and translation invariance. Recently, many vision transformer architectures have been proposed and they show promising…

Computer Vision and Pattern Recognition · Computer Science 2022-07-26 Pichao Wang , Xue Wang , Fan Wang , Ming Lin , Shuning Chang , Hao Li , Rong Jin

Are queries and keys always relevant? A case study on Transformer wave functions

The dot product attention mechanism, originally designed for natural language processing tasks, is a cornerstone of modern Transformers. It adeptly captures semantic relationships between word pairs in sentences by computing a similarity…

Disordered Systems and Neural Networks · Physics 2025-01-14 Riccardo Rende , Luciano Loris Viteritti

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations…

Machine Learning · Computer Science 2026-03-31 Hengshuai Yao , Xing Chen , Ahmed Murtadha , Guan Wang

Transformer Reconstructed with Dynamic Value Attention

Since transformer was firstly published in 2017, several works have been proposed to optimize it. However, the major structure of transformer remains unchanged, ignoring one of its main intrinsic limitations, which is the same static value…

Machine Learning · Computer Science 2025-12-30 Xiaowei Wang