Related papers: SwingArena: Competitive Programming Arena for Long…

CodeArena: A Collective Evaluation Platform for LLM Code Generation

Large Language Models (LLMs) have reshaped code generation by synergizing their exceptional comprehension of natural language and programming syntax, thereby substantially boosting developer productivity. These advancements have prompted…

Software Engineering · Computer Science 2025-03-04 Mingzhe Du , Anh Tuan Luu , Bin Ji , Xiaobao Wu , Dong Huang , Terry Yue Zhuo , Qian Liu , See-Kiong Ng

SWE-bench-java: A GitHub Issue Resolving Benchmark for Java

GitHub issue resolving is a critical task in software engineering, recently gaining significant attention in both industry and academia. Within this task, SWE-bench has been released to evaluate issue resolving capabilities of large…

Software Engineering · Computer Science 2024-08-27 Daoguang Zan , Zhirong Huang , Ailun Yu , Shaoxin Lin , Yifan Shi , Wei Liu , Dong Chen , Zongshuai Qi , Hao Yu , Lei Yu , Dezhi Ran , Muhan Zeng , Bo Shen , Pan Bian , Guangtai Liang , Bei Guan , Pengjie Huang , Tao Xie , Yongji Wang , Qianxiang Wang

RankArena: A Unified Platform for Evaluating Retrieval, Reranking and RAG with Human and LLM Feedback

Evaluating the quality of retrieval-augmented generation (RAG) and document reranking systems remains challenging due to the lack of scalable, user-centric, and multi-perspective evaluation tools. We introduce RankArena, a unified platform…

Information Retrieval · Computer Science 2025-08-08 Abdelrahman Abdallah , Mahmoud Abdalla , Bhawna Piryani , Jamshid Mozafari , Mohammed Ali , Adam Jatowt

SWE-Arena: An Interactive Platform for Evaluating Foundation Models in Software Engineering

Foundation models (FMs), particularly large language models (LLMs), have shown significant promise in various software engineering (SE) tasks, including code generation, debugging, and requirement refinement. Despite these advances,…

Software Engineering · Computer Science 2025-10-13 Zhimin Zhao

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Crowdsourced model evaluation platforms, such as Chatbot Arena, enable real-time evaluation from human perspectives to assess the quality of model responses. In the coding domain, manually examining the quality of LLM-generated content is…

Software Engineering · Computer Science 2025-12-19 Terry Yue Zhuo , Xiaolong Jin , Hange Liu , Juyong Jiang , Tianyang Liu , Chen Gong , Bhupesh Bishnoi , Vaisakhi Mishra , Marek Suppa , Noah Ziems , Saiteja Utpala , Ming Xu , Guangyu Song , Kaixin Li , Yuhan Cao , Bo Liu , Zheng Liu , Sabina Abdurakhmanova , Wenhao Yu , Mengzhao Jia , Jihan Yao , Kenneth Hamilton , Kumar Shridhar , Minh Chien Vu , Dingmin Wang , Jiawei Liu , Zijian Wang , Qian Liu , Binyuan Hui , Meg Risdal , Ahsen Khaliq , Atin Sood , Zhenchang Xing , Wasi Uddin Ahmad , John Grundy , David Lo , Banghua Zhu , Xiaoning Du , Torsten Scholak , Leandro von Werra

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination…

Software Engineering · Computer Science 2025-07-18 Pavel Adamenko , Mikhail Ivanov , Aidar Valeev , Rodion Levichev , Pavel Zadorozhny , Ivan Lopatin , Dmitry Babayev , Alena Fenogenova , Valentin Malykh

ResearchArena: Benchmarking Large Language Models' Ability to Collect and Organize Information as Research Agents

Large language models (LLMs) excel across many natural language processing tasks but face challenges in domain-specific, analytical tasks such as conducting research surveys. This study introduces ResearchArena, a benchmark designed to…

Artificial Intelligence · Computer Science 2025-09-09 Hao Kang , Chenyan Xiong

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Language models have outpaced our ability to evaluate them effectively, but for their future development it is essential to study the frontier of their capabilities. We find real-world software engineering to be a rich, sustainable, and…

Computation and Language · Computer Science 2024-11-13 Carlos E. Jimenez , John Yang , Alexander Wettig , Shunyu Yao , Kexin Pei , Ofir Press , Karthik Narasimhan

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning

The evaluation of large language models (LLMs) has traditionally relied on static benchmarks, a paradigm that poses two major limitations: (1) predefined test sets lack adaptability to diverse application domains, and (2) standardized…

Computation and Language · Computer Science 2025-05-29 Qingchen Yu , Zifan Zheng , Ding Chen , Simin Niu , Bo Tang , Feiyu Xiong , Zhiyu Li

SWE-Bench++: A Framework for the Scalable Generation of Software Engineering Benchmarks from Open-Source Repositories

Benchmarks like SWE-bench have standardized the evaluation of Large Language Models (LLMs) on repository-level software engineering tasks. However, these efforts remain limited by manual curation, static datasets, and a focus on…

Software Engineering · Computer Science 2025-12-22 Lilin Wang , Lucas Ramalho , Alan Celestino , Phuc Anthony Pham , Yu Liu , Umang Kumar Sinha , Andres Portillo , Onassis Osunwa , Gabriel Maduekwe

LongReasonArena: A Long Reasoning Benchmark for Large Language Models

Existing long-context benchmarks for Large Language Models (LLMs) focus on evaluating comprehension of long inputs, while overlooking the evaluation of long reasoning abilities. To address this gap, we introduce LongReasonArena, a benchmark…

Computation and Language · Computer Science 2025-08-28 Jiayu Ding , Shuming Ma , Lei Cui , Nanning Zheng , Furu Wei

TestBench: Evaluating Class-Level Test Case Generation Capability of Large Language Models

Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…

Software Engineering · Computer Science 2024-09-27 Quanjun Zhang , Ye Shang , Chunrong Fang , Siqi Gu , Jianyi Zhou , Zhenyu Chen

Long Code Arena: a Set of Benchmarks for Long-Context Code Models

Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few…

Machine Learning · Computer Science 2024-06-18 Egor Bogomolov , Aleksandra Eliseeva , Timur Galimzyanov , Evgeniy Glukhov , Anton Shapkin , Maria Tigina , Yaroslav Golubev , Alexander Kovrigin , Arie van Deursen , Maliheh Izadi , Timofey Bryksin

Inclusion Arena: An Open Platform for Evaluating Large Foundation Models with Real-World Apps

Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have ushered in a new era of AI capabilities, demonstrating near-human-level performance across diverse scenarios. While numerous benchmarks (e.g., MMLU) and…

Artificial Intelligence · Computer Science 2025-09-03 Kangyu Wang , Hongliang He , Lin Liu , Ruiqi Liang , Zhenzhong Lan , Jianguo Li

Integrating Large Language Models in Software Engineering Education: A Pilot Study through GitHub Repositories Mining

Context: Large Language Models (LLMs) such as ChatGPT are increasingly adopted in software engineering (SE) education, offering both opportunities and challenges. Their adoption requires systematic investigation to ensure responsible…

Software Engineering · Computer Science 2025-09-08 Maryam Khan , Muhammad Azeem Akbar , Jussi Kasurinen

Evaluating and Aligning CodeLLMs on Human Preference

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common…

Computation and Language · Computer Science 2024-12-09 Jian Yang , Jiaxi Yang , Ke Jin , Yibo Miao , Lei Zhang , Liqun Yang , Zeyu Cui , Yichang Zhang , Binyuan Hui , Junyang Lin

SC2Arena and StarEvolve: Benchmark and Self-Improvement Framework for LLMs in Complex Decision-Making Tasks

Evaluating large language models (LLMs) in complex decision-making is essential for advancing AI's ability for strategic planning and real-time adaptation. However, existing benchmarks for tasks like StarCraft II fail to capture the game's…

Machine Learning · Computer Science 2025-08-15 Pengbo Shen , Yaqing Wang , Ni Mu , Yao Luan , Runpeng Xie , Senhao Yang , Lexiang Wang , Hao Hu , Shuang Xu , Yiqin Yang , Bo Xu

TextArena

TextArena is an open-source collection of competitive text-based games for training and evaluation of agentic behavior in Large Language Models (LLMs). It spans 57+ unique environments (including single-player, two-player, and multi-player…

Computation and Language · Computer Science 2025-05-27 Leon Guertler , Bobby Cheng , Simon Yu , Bo Liu , Leshem Choshen , Cheston Tan

Are Large Language Models a Threat to Programming Platforms? An Exploratory Study

Competitive programming platforms like LeetCode, Codeforces, and HackerRank evaluate programming skills, often used by recruiters for screening. With the rise of advanced Large Language Models (LLMs) such as ChatGPT, Gemini, and Meta AI,…

Software Engineering · Computer Science 2024-09-10 Md Mustakim Billah , Palash Ranjan Roy , Zadia Codabux , Banani Roy

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can…

Computation and Language · Computer Science 2026-05-19 Tingfeng Hui , Hao Xu , Pengyu Zhu , Hongsheng Xin , Kun Zhan , Sen Su , Chunxiao Liu , Ning Miao