Related papers: CoMeT: Collaborative Memory Transformer for Effici…

CAMELoT: Towards Large Language Models with Training-Free Consolidated Associative Memory

Large Language Models (LLMs) struggle to handle long input sequences due to high memory and runtime costs. Memory-augmented models have emerged as a promising solution to this problem, but current methods are hindered by limited memory…

Computation and Language · Computer Science 2024-02-22 Zexue He , Leonid Karlinsky , Donghyun Kim , Julian McAuley , Dmitry Krotov , Rogerio Feris

Contextualize Knowledge Bases with Transformer for End-to-end Task-Oriented Dialogue Systems

Incorporating knowledge bases (KB) into end-to-end task-oriented dialogue systems is challenging, since it requires to properly represent the entity of KB, which is associated with its KB context and dialogue context. The existing works…

Computation and Language · Computer Science 2021-09-30 Yanjie Gou , Yinjie Lei , Lingqiao Liu , Yong Dai , Chunxu Shen

LoPT: Lossless Parallel Tokenization Acceleration for Long Context Inference of Large Language Model

Long context inference scenarios have become increasingly important for large language models, yet they introduce significant computational latency. While prior research has optimized long-sequence inference through operators, model…

Computation and Language · Computer Science 2025-11-10 Wei Shao , Lingchao Zheng , Pengyu Wang , Peizhen Zheng , Jun Li , Yuwei Fan

HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing

Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in…

Computation and Language · Computer Science 2025-02-07 Zifan He , Yingqi Cao , Zongyue Qin , Neha Prakriya , Yizhou Sun , Jason Cong

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Recurrent LLM architectures have emerged as a promising approach for improving reasoning, as they enable multi-step computation in the embedding space without generating intermediate tokens. Models such as Ouro perform reasoning by…

Computation and Language · Computer Science 2026-05-20 Victor Conchello Vendrell , Arnau Padres Masdemont , Niccolò Grillo , Jordi Ros-Giralt , Arash Behboodi , Fabio Valerio Massoli

Context Compression for Auto-regressive Transformers with Sentinel Tokens

The quadratic complexity of the attention module makes it gradually become the bulk of compute in Transformer-based LLMs during generation. Moreover, the excessive key-value cache that arises when dealing with long inputs also brings severe…

Computation and Language · Computer Science 2023-10-17 Siyu Ren , Qi Jia , Kenny Q. Zhu

CompLLM: Compression for Long Context Q&A

Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent…

Computation and Language · Computer Science 2025-09-24 Gabriele Berton , Jayakrishnan Unnikrishnan , Son Tran , Mubarak Shah

EdgeInfinite: A Memory-Efficient Infinite-Context Transformer for Edge Devices

Transformer-based large language models (LLMs) encounter challenges in processing long sequences on edge devices due to the quadratic complexity of attention mechanisms and growing memory demands from Key-Value (KV) cache. Existing KV cache…

Computation and Language · Computer Science 2025-03-31 Jiyu Chen , Shuang Peng , Daxiong Luo , Fan Yang , Renshou Wu , Fangyuan Li , Xiaoxin Chen

Augmenting Language Models with Long-Term Memory

Existing large language models (LLMs) can only afford fix-sized inputs due to the input length limit, preventing them from utilizing rich long-context information from past inputs. To address this, we propose a framework, Language Models…

Computation and Language · Computer Science 2023-06-13 Weizhi Wang , Li Dong , Hao Cheng , Xiaodong Liu , Xifeng Yan , Jianfeng Gao , Furu Wei

MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mechanisms typically treat retrieval as a static, single-step…

Multiagent Systems · Computer Science 2026-05-19 Haodong Lei , Junming Liu , Yirong Chen , Ding Wang , Hongsong Wang

Latent Context Compilation: Distilling Long Context into Compact Portable Memory

Efficient long-context LLM deployment is stalled by a dichotomy between amortized compression, which struggles with out-of-distribution generalization, and Test-Time Training, which incurs prohibitive synthetic data costs and requires…

Machine Learning · Computer Science 2026-02-26 Zeju Li , Yizhou Zhou , Qiang Xu

InfiniPot: Infinite Context Processing on Memory-Constrained LLMs

Handling long input contexts remains a significant challenge for Large Language Models (LLMs), particularly in resource-constrained environments such as mobile devices. Our work aims to address this limitation by introducing InfiniPot, a…

Computation and Language · Computer Science 2024-10-04 Minsoo Kim , Kyuhong Shim , Jungwook Choi , Simyung Chang

Core Context Aware Transformers for Long Context Language Modeling

Transformer-based Large Language Models (LLMs) have exhibited remarkable success in extensive tasks primarily attributed to self-attention mechanism, which requires a token to consider all preceding tokens as its context to compute…

Computation and Language · Computer Science 2025-08-05 Yaofo Chen , Zeng You , Shuhai Zhang , Haokun Li , Yirui Li , Yaowei Wang , Mingkui Tan

ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Transformer-based Language Models' computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce…

Computation and Language · Computer Science 2025-10-23 Kiarash Zahirnia , Zahra Golpayegani , Walid Ahmed , Yang Liu

Latent-Condensed Transformer for Efficient Long Context Modeling

Large language models (LLMs) face significant challenges in processing long contexts due to the linear growth of the key-value (KV) cache and quadratic complexity of self-attention. Existing approaches address these bottlenecks separately:…

Computation and Language · Computer Science 2026-04-17 Zeng You , Yaofo Chen , Qiuwu Chen , Ying Sun , Shuhai Zhang , Yingjian Li , Yaowei Wang , Mingkui Tan

COMET: A Neural Framework for MT Evaluation

We present COMET, a neural framework for training multilingual machine translation evaluation models which obtains new state-of-the-art levels of correlation with human judgements. Our framework leverages recent breakthroughs in…

Computation and Language · Computer Science 2020-10-20 Ricardo Rei , Craig Stewart , Ana C Farinha , Alon Lavie

Long-Context Language Modeling with Parallel Context Encoding

Extending large language models (LLMs) to process longer inputs is crucial for a wide range of applications. However, the substantial computational cost of transformers and limited generalization of positional encoding restrict the size of…

Computation and Language · Computer Science 2025-06-11 Howard Yen , Tianyu Gao , Danqi Chen

Sequence Shortening for Context-Aware Machine Translation

Context-aware Machine Translation aims to improve translations of sentences by incorporating surrounding sentences as context. Towards this task, two main architectures have been applied, namely single-encoder (based on concatenation) and…

Computation and Language · Computer Science 2024-02-05 Paweł Mąka , Yusuf Can Semerci , Jan Scholtes , Gerasimos Spanakis

SimpleMem: Efficient Lifelong Memory for LLM Agents

To support long-term interaction in complex environments, LLM agents require memory systems that manage historical experiences. Existing approaches either retain full interaction histories via passive context extension, leading to…

Artificial Intelligence · Computer Science 2026-01-30 Jiaqi Liu , Yaofeng Su , Peng Xia , Siwei Han , Zeyu Zheng , Cihang Xie , Mingyu Ding , Huaxiu Yao

Recurrent Memory Transformer

Transformer-based models show their effectiveness across multiple domains and tasks. The self-attention allows to combine information from all sequence elements into context-aware representations. However, global and local information has…

Computation and Language · Computer Science 2022-12-09 Aydar Bulatov , Yuri Kuratov , Mikhail S. Burtsev