Related papers: Agentic Repository Mining: A Multi-Task Evaluation

ContextBench: A Benchmark for Context Retrieval in Coding Agents

LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during…

Machine Learning · Computer Science 2026-02-12 Han Li , Letian Zhu , Bohan Zhang , Rili Feng , Jiaming Wang , Yue Pan , Earl T. Barr , Federica Sarro , Zhaoyang Chu , He Ye

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

A widespread practice in software development is to tailor coding agents to repositories using context files, such as AGENTS.md, by either manually or automatically generating them. Although this practice is strongly encouraged by agent…

Software Engineering · Computer Science 2026-02-13 Thibaud Gloaguen , Niels Mündler , Mark Müller , Veselin Raychev , Martin Vechev

WideSearch: Benchmarking Agentic Broad Info-Seeking

From professional research to everyday planning, many tasks are bottlenecked by wide-scale information seeking, which is more repetitive than cognitively complex. With the rapid development of Large Language Models (LLMs), automated search…

Computation and Language · Computer Science 2025-08-29 Ryan Wong , Jiawei Wang , Junjie Zhao , Li Chen , Yan Gao , Long Zhang , Xuan Zhou , Zuo Wang , Kai Xiang , Ge Zhang , Wenhao Huang , Yang Wang , Ke Wang

RefactorBench: Evaluating Stateful Reasoning in Language Agents Through Code

Recent advances in language model (LM) agents and function calling have enabled autonomous, feedback-driven systems to solve problems across various digital domains. To better understand the unique limitations of LM agents, we introduce…

Artificial Intelligence · Computer Science 2025-03-12 Dhruv Gautam , Spandan Garg , Jinu Jang , Neel Sundaresan , Roshanak Zilouchian Moghaddam

Context Training with Active Information Seeking

Most existing large language models (LLMs) are expensive to adapt after deployment, especially when a task requires newly produced information or niche domain knowledge. Recent work has shown that, by manipulating and optimizing their…

Computation and Language · Computer Science 2026-05-15 Zeyu Huang , Adhiguna Kuncoro , Qixuan Feng , Jiajun Shen , Lucio Dery , Arthur Szlam , Marc'Aurelio Ranzato

AgentBench: Evaluating LLMs as Agents

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Artificial Intelligence · Computer Science 2025-10-07 Xiao Liu , Hao Yu , Hanchen Zhang , Yifan Xu , Xuanyu Lei , Hanyu Lai , Yu Gu , Hangliang Ding , Kaiwen Men , Kejuan Yang , Shudan Zhang , Xiang Deng , Aohan Zeng , Zhengxiao Du , Chenhui Zhang , Sheng Shen , Tianjun Zhang , Yu Su , Huan Sun , Minlie Huang , Yuxiao Dong , Jie Tang

On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks

Large language models (LLMs) have shown to be valuable tools for tackling process mining tasks. Existing studies report on their capability to support various data-driven process analyses and even, to some extent, that they are able to…

Databases · Computer Science 2025-05-01 Adrian Rebmann , Fabian David Schmidt , Goran Glavaš , Han van der Aa

Less is More: Benchmarking LLM Based Recommendation Agents

Large Language Models (LLMs) are increasingly deployed for personalized product recommendations, with practitioners commonly assuming that longer user purchase histories lead to better predictions. We challenge this assumption through a…

Information Retrieval · Computer Science 2026-01-29 Kargi Chauhan , Mahalakshmi Venkateswarlu

Scaling Agentic Capabilities, Not Context: Efficient Reinforcement Finetuning for Large Toolspaces

Agentic systems operating over large tool ecosystems must plan and execute long-horizon workflows under weak or non-verifiable supervision. While frontier models mitigate these challenges through scale and large context budgets, small…

Machine Learning · Computer Science 2026-03-10 Karan Gupta , Pranav Vajreshwari , Yash Pandya , Raghav Magazine , Akshay Nambi , Ahmed Awadallah

Reliable Annotations with Less Effort: Evaluating LLM-Human Collaboration in Search Clarifications

Despite growing interest in using large language models (LLMs) to automate annotation, their effectiveness in complex, nuanced, and multi-dimensional labelling tasks remains relatively underexplored. This study focuses on annotation for the…

Information Retrieval · Computer Science 2025-07-02 Leila Tavakoli , Hamed Zamani

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long…

Computation and Language · Computer Science 2024-11-07 Yuri Kuratov , Aydar Bulatov , Petr Anokhin , Ivan Rodkin , Dmitry Sorokin , Artyom Sorokin , Mikhail Burtsev

A Reference Architecture for Agentic Hybrid Retrieval in Dataset Search

Ad hoc dataset search requires matching underspecified natural-language queries against sparse, heterogeneous metadata records, a task where typical lexical or dense retrieval alone falls short. We reposition dataset search as a…

Information Retrieval · Computer Science 2026-04-21 Riccardo Terrenzi , Phongsakon Mark Konrad , Tim Lukas Adam , Serkan Ayvaz

Code2UML: Agentic LLMs with context engineering for scalable software visualization

Large Language Model (LLM)-based code analysis tools are adopted to automate software documentation tasks. However, the scalability of these approaches to real codebases, where Intermediate Representations (IR) exceed LLM context limits,…

Software Engineering · Computer Science 2026-05-26 Alin-Gabriel Văduva , Anca-Ioana Andreescu , Simona-Vasilica Oprea , Adela Bâra

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate…

Computation and Language · Computer Science 2025-04-24 Jonathan Roberts , Kai Han , Samuel Albanie

Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models

Despite their widespread adoption, large language models (LLMs) remain prohibitive to use under resource constraints, with their ever growing sizes only increasing the barrier for use. One noted issue is the high latency associated with…

Machine Learning · Computer Science 2024-12-17 Jerry Huang , Prasanna Parthasarathi , Mehdi Rezagholizadeh , Sarath Chandar

From Online User Feedback to Requirements: Evaluating Large Language Models for Classification and Specification Tasks

[Context and Motivation] Online user feedback provides valuable information to support requirements engineering (RE). However, analyzing online user feedback is challenging due to its large volume and noise. Large language models (LLMs)…

Software Engineering · Computer Science 2025-10-28 Manjeshwar Aniruddh Mallya , Alessio Ferrari , Mohammad Amin Zadenoori , Jacek Dąbrowski

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is…

Software Engineering · Computer Science 2025-10-24 Jiale Guo , Suizhi Huang , Mei Li , Dong Huang , Xingsheng Chen , Regina Zhang , Zhijiang Guo , Han Yu , Siu-Ming Yiu , Pietro Lio , Kwok-Yan Lam

FML-bench: Benchmarking Machine Learning Agents for Scientific Research

Large language models (LLMs) have sparked growing interest in machine learning research agents that can autonomously propose ideas and conduct experiments. However, existing benchmarks predominantly adopt an engineering-oriented…

Computation and Language · Computer Science 2026-02-26 Qiran Zou , Hou Hei Lam , Wenhao Zhao , Yiming Tang , Tingting Chen , Samson Yu , Tianyi Zhang , Chang Liu , Xiangyang Ji , Dianbo Liu

ToolScope: Enhancing LLM Agent Tool Use through Tool Merging and Context-Aware Filtering

Large language model (LLM) agents rely on external tools to solve complex tasks, but real-world toolsets often contain redundant tools with overlapping names and descriptions, introducing ambiguity and reducing selection accuracy. LLMs also…

Computation and Language · Computer Science 2026-05-12 Marianne Menglin Liu , Daniel Garcia , Fjona Parllaku , Vikas Upadhyay , Syed Fahad Allam Shah , Dan Roth

LocalSearchBench: Benchmarking Agentic Search in Real-World Local Life Services

Recent advances in large reasoning models LRMs have enabled agentic search systems to perform complex multi-step reasoning across multiple sources. However, most studies focus on general information retrieval and rarely explores vertical…

Artificial Intelligence · Computer Science 2026-01-14 Hang He , Chuhuai Yue , Chengqi Dong , Mingxue Tian , Hao Chen , Zhenfeng Liu , Jiajun Chai , Xiaohan Wang , Yufei Zhang , Qun Liao , Guojun Yin , Wei Lin , Chengcheng Wan , Haiying Sun , Ting Su