English
Related papers

Related papers: Minerva: A Programmable Memory Test Benchmark for …

200 papers

Large Language Model (LLM)-based agents are increasingly deployed for complex, tool-based tasks where long-term memory is critical to driving actions. Existing benchmarks, however, primarily test a angent's ability to passively retrieve…

Computation and Language · Computer Science 2026-01-29 Yiting Shen , Kun Li , Wei Zhou , Songlin Hu

Recent works have highlighted the significance of memory mechanisms in LLM-based agents, which enable them to store observed information and adapt to dynamic environments. However, evaluating their memory capabilities still remains…

Computation and Language · Computer Science 2025-06-30 Haoran Tan , Zeyu Zhang , Chen Ma , Xu Chen , Quanyu Dai , Zhenhua Dong

Scaling up data, parameters, and test-time computation has been the mainstream methods to improve LLM systems (LLMsys), but their upper bounds are almost reached due to the gradual depletion of high-quality data and marginal gains obtained…

Machine Learning · Computer Science 2026-05-12 Qingyao Ai , Yichen Tang , Changyue Wang , Jianming Long , Weihang Su , Yiqun Liu

Long-term memory (LTM) is essential for large language models (LLMs) to achieve autonomous intelligence in complex, evolving environments. Despite increasing efforts in memory-augmented and retrieval-based architectures, there remains a…

Computation and Language · Computer Science 2025-06-17 Luanbo Wan , Weizhi Ma

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly…

Machine Learning · Computer Science 2026-05-25 Alina Shutova , Alexandra Olenina , Ivan Vinogradov , Anton Sinitsin

Current AI agents excel in familiar settings, but fail sharply when faced with novel tasks with unseen vocabularies -- a core limitation of procedural memory systems. We present the first benchmark that isolates procedural memory retrieval…

Computation and Language · Computer Science 2025-12-01 Ishant Kohar , Aswanth Krishnan

Recent benchmarks for Large Language Model (LLM) agents primarily focus on evaluating reasoning, planning, and execution capabilities, while another critical component-memory, encompassing how agents memorize, update, and retrieve long-term…

Computation and Language · Computer Science 2026-03-19 Yuanzhe Hu , Yu Wang , Julian McAuley

Recent benchmarks for Large Language Model (LLM) agents mainly evaluate reasoning, planning, and execution. However, memory is also essential for agents, as it enables them to store, update, and retrieve information over time. This ability…

Computation and Language · Computer Science 2026-05-19 Yuyao Wang , Zhongjian Zhang , Mo Chi , Kaichi Yu , Yuhan Li , Miao Peng , Bing Tong , Chen Zhang , Yan Zhou , Jia Li

Modern language models (LMs) pose a new challenge in capability assessment. Static benchmarks inevitably saturate without providing confidence in the deployment tolerances of LM-based systems, but developers nonetheless claim that their…

Software Engineering · Computer Science 2024-07-31 Michael Saxon , Ari Holtzman , Peter West , William Yang Wang , Naomi Saphra

We present a new approach for benchmarking Large Language Model (LLM) capabilities on research-level mathematics. Existing benchmarks largely rely on static, hand-curated sets of contest or textbook-style problems as proxies for…

Artificial Intelligence · Computer Science 2026-03-02 Antoine Peyronnet , Fabian Gloeckle , Amaury Hayat

Large Language Models (LLMs) represent a landmark achievement in Artificial Intelligence (AI), demonstrating unprecedented proficiency in procedural tasks such as text generation, code completion, and conversational coherence. These…

Artificial Intelligence · Computer Science 2025-05-07 Schaun Wheeler , Olivier Jeunen

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative…

Computation and Language · Computer Science 2026-02-24 Mohammad Tavakoli , Alireza Salemi , Carrie Ye , Mohamed Abdalla , Hamed Zamani , J Ross Mitchell

Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit…

Machine Learning · Computer Science 2024-12-03 Eduardo Slonski

Existing works increasingly adopt memory-centric mechanisms to process long contexts in a segment manner, and effective memory management is one of the key capabilities that enables large language models to effectively propagate information…

Computation and Language · Computer Science 2026-01-27 Zecheng Tang , Baibei Ji , Ruoxi Sun , Haitian Wang , WangJie You , Zhang Yijun , Wenpeng Zhu , Ji Qi , Juntao Li , Min Zhang

As Large Language Models (LLMs) evolve from static dialogue interfaces to autonomous general agents, effective memory is paramount to ensuring long-term consistency. However, existing benchmarks primarily focus on casual conversation or…

Computation and Language · Computer Science 2026-01-13 Haonan Bian , Zhiyuan Yao , Sen Hu , Zishan Xu , Shaolei Zhang , Yifu Guo , Ziliang Yang , Xueran Han , Huacan Wang , Ronghao Chen

Vocabulary tests, once a cornerstone of language modeling evaluation, have been largely overlooked in the current landscape of Large Language Models (LLMs) like Llama, Mistral, and GPT. While most LLM evaluation benchmarks focus on specific…

Commonly, AI or machine learning (ML) models are evaluated on benchmark datasets. This practice supports innovative methodological research, but benchmark performance can be poorly correlated with performance in real-world applications -- a…

Machine Learning · Computer Science 2024-06-18 Olivier Binette , Jerome P. Reiter

Large language models (LLMs) are powerful tools capable of handling diverse tasks. Comparing and selecting appropriate LLMs for specific tasks requires systematic evaluation methods, as models exhibit varying capabilities across different…

Computation and Language · Computer Science 2025-06-04 Anna Sokol , Elizabeth Daly , Michael Hind , David Piorkowski , Xiangliang Zhang , Nuno Moniz , Nitesh Chawla

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive…

Computation and Language · Computer Science 2025-02-21 James Fodor

Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative…

Computation and Language · Computer Science 2026-05-04 Yanchen Wu , Tenghui Lin , Yingli Zhou , Fangyuan Zhang , Qintian Guo , Xun Zhou , Sibo Wang , Xilin Liu , Yuchi Ma , Yixiang Fang
‹ Prev 1 2 3 10 Next ›