Related papers: Minerva: A Programmable Memory Test Benchmark for …

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents

Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve…

Computation and Language · Computer Science 2026-01-29 Yiting Shen , Kun Li , Wei Zhou , Songlin Hu

MemBench: Towards More Comprehensive Evaluation on the Memory of LLM-based Agents

Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains…

Computation and Language · Computer Science 2025-06-30 Haoran Tan , Zeyu Zhang , Chen Ma , Xu Chen , Quanyu Dai , Zhenhua Dong

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained…

Machine Learning · Computer Science 2026-05-12 Qingyao Ai , Yichen Tang , Changyue Wang , Jianming Long , Weihang Su , Yiqun Liu

StoryBench: A Dynamic Benchmark for Evaluating Long-Term Memory with Multi Turns

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a…

Computation and Language · Computer Science 2025-06-17 Luanbo Wan , Weizhi Ma

Evaluating Memory Structure in LLM Agents

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly…

Machine Learning · Computer Science 2026-05-25 Alina Shutova , Alexandra Olenina , Ivan Vinogradov , Anton Sinitsin

A Benchmark for Procedural Memory Retrieval in Language Agents

Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval…

Computation and Language · Computer Science 2025-12-01 Ishant Kohar , Aswanth Krishnan

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term…

Computation and Language · Computer Science 2026-03-19 Yuanzhe Hu , Yu Wang , Julian McAuley

EvoMemBench: Benchmarking Agent Memory from a Self-Evolving Perspective

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability…

Computation and Language · Computer Science 2026-05-19 Yuyao Wang , Zhongjian Zhang , Mo Chi , Kaichi Yu , Yuhan Li , Miao Peng , Bing Tong , Chen Zhang , Yan Zhou , Jia Li

Benchmarks as Microscopes: A Call for Model Metrology

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their…

Software Engineering · Computer Science 2024-07-31 Michael Saxon , Ari Holtzman , Peter West , William Yang Wang , Naomi Saphra

LemmaBench: A Live, Research-Level Benchmark to Evaluate LLM Capabilities in Mathematics

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Procedural Memory Is Not All You Need: Bridging Cognitive Gaps in LLM-Based Agents

Large Language Models (LLMs) represent a landmark achievement in Artificial Intelligence (AI), demonstrating unprecedented proficiency in procedural tasks such as text generation, code completion, and conversational coherence. These…

Artificial Intelligence · Computer Science 2025-05-07 Schaun Wheeler , Olivier Jeunen

Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative…

Computation and Language · Computer Science 2026-02-24 Mohammad Tavakoli , Alireza Salemi , Carrie Ye , Mohamed Abdalla , Hamed Zamani , J Ross Mitchell

Detecting Memorization in Large Language Models

Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit…

Machine Learning · Computer Science 2024-12-03 Eduardo Slonski

MemoryRewardBench: Benchmarking Reward Models for Long-Term Memory Management in Large Language Models

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information…

Computation and Language · Computer Science 2026-01-27 Zecheng Tang , Baibei Ji , Ruoxi Sun , Haitian Wang , WangJie You , Zhang Yijun , Wenpeng Zhu , Ji Qi , Juntao Li , Min Zhang

RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction

As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or…

Computation and Language · Computer Science 2026-01-13 Haonan Bian , Zhiyuan Yao , Sen Hu , Zishan Xu , Shaolei Zhang , Yifu Guo , Ziliang Yang , Xueran Han , Huacan Wang , Ronghao Chen

Establishing Vocabulary Tests as a Benchmark for Evaluating Large Language Models

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific…

Computation and Language · Computer Science 2024-01-30 Gonzalo Martínez , Javier Conde , Elena Merino-Gómez , Beatriz Bermúdez-Margaretto , José Alberto Hernández , Pedro Reviriego , Marc Brysbaert

Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework

Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a…

Machine Learning · Computer Science 2024-06-18 Olivier Binette , Jerome P. Reiter

BenchmarkCards: Standardized Documentation for Large Language Model Benchmarks

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative…

Computation and Language · Computer Science 2026-05-04 Yanchen Wu , Tenghui Lin , Yingli Zhou , Fangyuan Zhang , Qintian Guo , Xun Zhou , Sibo Wang , Xilin Liu , Yuchi Ma , Yixiang Fang