English
Related papers

Related papers: CIBench: Evaluating Your LLMs with a Code Interpre…

200 papers

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available,…

The potential of Large Language Model (LLM) as agents has been widely acknowledged recently. Thus, there is an urgent need to quantitatively \textit{evaluate LLMs as agents} on challenging tasks in interactive environments. We present…

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as…

Software Engineering · Computer Science 2024-08-06 Yaolun Zhang , Yinxu Pan , Yudong Wang , Jie Cai

Large Language Models (LLMs) show promise as data analysis agents, but existing benchmarks overlook the iterative nature of the field, where experts' decisions evolve with deeper insights of the dataset. To address this, we introduce…

Computation and Language · Computer Science 2025-06-09 Hanyu Li , Haoyu Liu , Tingyu Zhu , Tianyu Guo , Zeyu Zheng , Xiaotie Deng , Michael I. Jordan

Agents based on Large Language Models (LLMs) have shown promise for performing sophisticated software engineering tasks autonomously. In addition, there has been progress towards developing agents that can perform parts of the research…

Computation and Language · Computer Science 2026-04-23 Nicholas Edwards , Yukyung Lee , Yujun Audrey Mao , Yulu Qin , Sebastian Schuster , Najoung Kim

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

Large Language Models (LLMs) have made progress in various real-world tasks, which stimulates requirements for the evaluation of LLMs. Existing LLM evaluation methods are mainly supervised signal-based which depends on static datasets and…

Computation and Language · Computer Science 2023-09-11 Jiatong Li , Rui Li , Qi Liu

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. However, evaluating LLMs in…

Cryptography and Security · Computer Science 2024-06-12 Tam n. Nguyen

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To…

Computation and Language · Computer Science 2025-10-20 Wei He , Yueqing Sun , Hongyan Hao , Xueyuan Hao , Zhikang Xia , Qi Gu , Chengcheng Han , Dengchang Zhao , Hui Su , Kefeng Zhang , Man Gao , Xi Su , Xiaodong Cai , Xunliang Cai , Yu Yang , Yunke Zhao

We introduce ISO-Bench, a benchmark for coding agents to test their capabilities on real-world inference optimization tasks. These tasks were taken from vLLM and SGLang, two of the most popular LLM serving frameworks. Each task provides an…

Machine Learning · Computer Science 2026-02-24 Ayush Nangia , Shikhar Mishra , Aman Gokrani , Paras Chopra

This paper presents DataSciBench, a comprehensive benchmark for evaluating Large Language Model (LLM) capabilities in data science. Recent related benchmarks have primarily focused on single tasks, easily obtainable ground truth, and…

Computation and Language · Computer Science 2025-02-20 Dan Zhang , Sining Zhoubian , Min Cai , Fengzu Li , Lekang Yang , Wei Wang , Tianjiao Dong , Ziniu Hu , Jie Tang , Yisong Yue

As large language models (LLMs) evolve into sophisticated autonomous agents capable of complex software development tasks, evaluating their real-world capabilities becomes critical. While existing benchmarks like…

Although LLM-based agents have attracted significant attention in domains such as software engineering and machine learning research, their role in advancing combinatorial optimization (CO) remains relatively underexplored. This gap…

Computation and Language · Computer Science 2025-08-25 Weiwei Sun , Shengyu Feng , Shanda Li , Yiming Yang

Tool use has turned large language models (LLMs) into powerful agents that can perform complex multi-step tasks by dynamically utilising external software components. However, these tools must be implemented in advance by human developers,…

Computation and Language · Computer Science 2025-06-02 Georg Wölflein , Dyke Ferber , Daniel Truhn , Ognjen Arandjelović , Jakob Nikolas Kather

Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going…

Computation and Language · Computer Science 2024-07-16 Anni Zou , Wenhao Yu , Hongming Zhang , Kaixin Ma , Deng Cai , Zhuosheng Zhang , Hai Zhao , Dong Yu

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of creative simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation…

Computation and Language · Computer Science 2024-09-13 Qi Jia , Xiang Yue , Tianyu Zheng , Jie Huang , Bill Yuchen Lin

LLMs have gotten attention across various research domains due to their exceptional performance on a wide range of complex tasks. Therefore, refined methods to evaluate the capabilities of LLMs are needed to determine the tasks and…

Computation and Language · Computer Science 2024-06-04 Ji-Lun Peng , Sijia Cheng , Egil Diau , Yung-Yu Shih , Po-Heng Chen , Yen-Ting Lin , Yun-Nung Chen
‹ Prev 1 2 3 10 Next ›