Related papers: Neurocache: Efficient Vector Retrieval for Long-ra…

MiniCache: KV Cache Compression in Depth Dimension for Large Language Models

A critical approach for efficiently deploying computationally demanding large language models (LLMs) is Key-Value (KV) caching. The KV cache stores key-value states of previously generated tokens, significantly reducing the need for…

Computation and Language · Computer Science 2024-09-10 Akide Liu , Jing Liu , Zizheng Pan , Yefei He , Gholamreza Haffari , Bohan Zhuang

Improving Neural Language Models with a Continuous Cache

We propose an extension to neural network language models to adapt their prediction to the recent history. Our model is a simplified version of memory augmented networks, which stores past hidden activations as memory and accesses them…

Computation and Language · Computer Science 2016-12-15 Edouard Grave , Armand Joulin , Nicolas Usunier

MemLong: Memory-Augmented Retrieval for Long Text Modeling

Recent advancements in Large Language Models (LLMs) have yielded remarkable success across diverse fields. However, handling long contexts remains a significant challenge for LLMs due to the quadratic time and space complexity of attention…

Computation and Language · Computer Science 2024-09-02 Weijie Liu , Zecheng Tang , Juntao Li , Kehai Chen , Min Zhang

Efficient LLM Inference with Kcache

Large Language Models(LLMs) have had a profound impact on AI applications, particularly in the domains of long-text comprehension and generation. KV Cache technology is one of the most widely used techniques in the industry. It ensures…

Computation and Language · Computer Science 2024-04-30 Qiaozhi He , Zhihua Wu

Augmenting Language Models with Long-Term Memory

Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models…

Computation and Language · Computer Science 2023-06-13 Weizhi Wang , Li Dong , Hao Cheng , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

Unbounded cache model for online language modeling with open vocabulary

Recently, continuous cache models were proposed as extensions to recurrent neural network language models, to adapt their predictions to local changes in the data distribution. These models only capture the local context, of up to a few…

Machine Learning · Computer Science 2017-11-08 Edouard Grave , Moustapha Cisse , Armand Joulin

Neural Language Modeling With Implicit Cache Pointers

A cache-inspired approach is proposed for neural language models (LMs) to improve long-range dependency and better predict rare words from long contexts. This approach is a simpler alternative to attention-based pointer mechanism that…

Audio and Speech Processing · Electrical Eng. & Systems 2020-09-30 Ke Li , Daniel Povey , Sanjeev Khudanpur

Memorizing Transformers

Language models typically need to be trained or finetuned in order to acquire new knowledge, which involves updating their weights. We instead envision language models that can simply read and memorize new data at inference time, thus…

Machine Learning · Computer Science 2022-03-18 Yuhuai Wu , Markus N. Rabe , DeLesley Hutchins , Christian Szegedy

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the…

Computation and Language · Computer Science 2025-03-21 Shibo Jie , Yehui Tang , Kai Han , Zhi-Hong Deng , Jing Han

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As…

Machine Learning · Computer Science 2025-07-22 Dachuan Shi , Yonggan Fu , Xiangchi Yuan , Zhongzhi Yu , Haoran You , Sixu Li , Xin Dong , Jan Kautz , Pavlo Molchanov , Yingyan , Lin

Needle in the Haystack for Memory Based Large Language Models

Current large language models (LLMs) often perform poorly on simple fact retrieval tasks. Here we investigate if coupling a dynamically adaptable external memory to a LLM can alleviate this problem. For this purpose, we test Larimar, a…

Computation and Language · Computer Science 2024-07-15 Elliot Nelson , Georgios Kollias , Payel Das , Subhajit Chaudhury , Soham Dan

ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models

Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of…

Computation and Language · Computer Science 2025-07-16 Jianxin Yan , Wangze Ni , Lei Chen , Xuemin Lin , Peng Cheng , Zhan Qin , Kui Ren

A Survey on Large Language Model Acceleration based on KV Cache Management

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the…

Artificial Intelligence · Computer Science 2025-07-31 Haoyang Li , Yiming Li , Anxin Tian , Tianhao Tang , Zhanchao Xu , Xuejia Chen , Nicole Hu , Wei Dong , Qing Li , Lei Chen

FlashBack:Efficient Retrieval-Augmented Language Modeling for Long Context Inference

Retrieval-Augmented Language Modeling (RALM) by integrating large language models (LLM) with relevant documents from an external corpus is a proven method for enabling the LLM to generate information beyond the scope of its pre-training…

Computation and Language · Computer Science 2025-06-16 Runheng Liu , Xingchen Xiao , Heyan Huang , Zewen Chi , Zhijing Wu

Great Memory, Shallow Reasoning: Limits of $k$NN-LMs

$K$-nearest neighbor language models ($k$NN-LMs), which integrate retrieval with next-word prediction, have demonstrated strong performance in language modeling as well as downstream NLP benchmarks. These results have led researchers to…

Computation and Language · Computer Science 2024-08-22 Shangyi Geng , Wenting Zhao , Alexander M Rush

CacheFocus: Dynamic Cache Re-Positioning for Efficient Retrieval-Augmented Generation

Large Language Models (LLMs) excel across a variety of language tasks yet are constrained by limited input lengths and high computational costs. Existing approaches\textemdash such as relative positional encodings (e.g., RoPE, ALiBi) and…

Computation and Language · Computer Science 2025-02-18 Kun-Hui Lee , Eunhwan Park , Donghoon Han , Seung-Hoon Na

Why do Nearest Neighbor Language Models Work?

Language models (LMs) compute the probability of a text by sequentially computing a representation of an already-seen context and using this representation to predict the next word. Currently, most LMs calculate these representations…

Computation and Language · Computer Science 2023-01-18 Frank F. Xu , Uri Alon , Graham Neubig

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

In this paper, we rethink sparse lexical representations for image retrieval. By utilizing multi-modal large language models (M-LLMs) that support visual prompting, we can extract image features and convert them into textual data, enabling…

Computer Vision and Pattern Recognition · Computer Science 2024-08-30 Kengo Nakata , Daisuke Miyashita , Youyang Ng , Yasuto Hoshi , Jun Deguchi

Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when…

Computation and Language · Computer Science 2018-05-15 Urvashi Khandelwal , He He , Peng Qi , Dan Jurafsky

LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference

Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time…

Computation and Language · Computer Science 2024-06-27 Zhongwei Wan , Ziang Wu , Che Liu , Jinfa Huang , Zhihong Zhu , Peng Jin , Longyue Wang , Li Yuan