Related papers: FullStack Bench: Evaluating LLMs as Full Stack Cod…

Multi-Programming Language Sandbox for LLMs

We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for Large Language Models (LLMs). It can automatically identify the…

Software Engineering · Computer Science 2024-11-06 Shihan Dou , Jiazheng Zhang , Jianxiang Zang , Yunbo Tao , Weikang Zhou , Haoxiang Jia , Shichun Liu , Yuming Yang , Zhiheng Xi , Shenxi Wu , Shaoqing Zhang , Muling Wu , Changze Lv , Limao Xiong , Wenyu Zhan , Lin Zhang , Rongxiang Weng , Jingang Wang , Xunliang Cai , Yueming Wu , Ming Wen , Rui Zheng , Tao Ji , Yixin Cao , Tao Gui , Xipeng Qiu , Qi Zhang , Xuanjing Huang

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Software Engineering · Computer Science 2025-04-30 Wenjing Yin , Tianze Sun , Yijiong Yu , Jiawei Fang , Guangyao Su , Jiancheng Wang , Zekun Wang , Wei Wang , Ran Chen , Ziyun Dai , Shuai Yuan , Menghang Dong , Peng Luo , Dong Cao , Da Lei , Yajun Zhang , Hao Chen , Xiang Ma , Yong Liu , Weifeng Liu , Yuanjian Xu , Ji Pei

BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

As large language models (LLMs) continue to advance, the need for up-to-date and well-organized benchmarks becomes increasingly critical. However, many existing datasets are scattered, difficult to manage, and make it challenging to perform…

Machine Learning · Computer Science 2025-06-03 Eunsu Kim , Haneul Yoo , Guijin Son , Hitesh Patel , Amit Agarwal , Alice Oh

BenchMAX: A Comprehensive Multilingual Evaluation Suite for Large Language Models

Previous multilingual benchmarks focus primarily on simple understanding tasks, but for large language models(LLMs), we emphasize proficiency in instruction following, reasoning, long context understanding, code generation, and so on.…

Computation and Language · Computer Science 2025-04-22 Xu Huang , Wenhao Zhu , Hanxu Hu , Conghui He , Lei Li , Shujian Huang , Fei Yuan

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Large language models (LLMs) have demonstrated significant potential in advancing various fields of research and society. However, the current community of LLMs overly focuses on benchmarks for analyzing specific foundational skills (e.g.…

Computation and Language · Computer Science 2025-03-03 Xiaoshuai Song , Muxi Diao , Guanting Dong , Zhengyang Wang , Yujia Fu , Runqi Qiao , Zhexu Wang , Dayuan Fu , Huangxuan Wu , Bin Liang , Weihao Zeng , Yejie Wang , Zhuoma GongQue , Jianing Yu , Qiuna Tan , Weiran Xu

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs

The adeptness of Large Language Models (LLMs) in comprehending and following natural language instructions is critical for their deployment in sophisticated real-world applications. Existing evaluations mainly focus on fragmented…

Computation and Language · Computer Science 2025-05-07 Tao Zhang , Chenglin Zhu , Yanjun Shen , Wenjing Luo , Yan Zhang , Hao Liang , Tao Zhang , Fan Yang , Mingan Lin , Yujing Qiao , Weipeng Chen , Bin Cui , Wentao Zhang , Zenan Zhou

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

INTEGRALBENCH: Benchmarking LLMs with Definite Integral Problems

We present INTEGRALBENCH, a focused benchmark designed to evaluate Large Language Model (LLM) performance on definite integral problems. INTEGRALBENCH provides both symbolic and numerical ground truth solutions with manual difficulty…

Artificial Intelligence · Computer Science 2025-07-30 Bintao Tang , Xin Yang , Yuhao Wang , Zixuan Qiu , Zimo Ji , Wenyuan Jiang

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available,…

Computation and Language · Computer Science 2024-02-27 Fahim Dalvi , Maram Hasanain , Sabri Boughorbel , Basel Mousi , Samir Abdaljalil , Nizi Nazar , Ahmed Abdelali , Shammur Absar Chowdhury , Hamdy Mubarak , Ahmed Ali , Majd Hawasly , Nadir Durrani , Firoj Alam

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal…

Artificial Intelligence · Computer Science 2024-10-02 Xuwu Wang , Qiwen Cui , Yunzhe Tao , Yiran Wang , Ziwei Chai , Xiaotian Han , Boyi Liu , Jianbo Yuan , Jing Su , Guoyin Wang , Tingkai Liu , Liyu Chen , Tianyi Liu , Tao Sun , Yufeng Zhang , Sirui Zheng , Quanzeng You , Yang Yang , Hongxia Yang

CIBench: Evaluating Your LLMs with a Code Interpreter Plugin

While LLM-Based agents, which use external tools to solve complex problems, have made significant progress, benchmarking their ability is challenging, thereby hindering a clear understanding of their limitations. In this paper, we propose…

Computation and Language · Computer Science 2024-11-07 Chuyu Zhang , Songyang Zhang , Yingfan Hu , Haowen Shen , Kuikun Liu , Zerun Ma , Fengzhe Zhou , Wenwei Zhang , Xuming He , Dahua Lin , Kai Chen

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation

Large Language Models (LLMs) have demonstrated remarkable performance on assisting humans in programming and facilitating programming automation. However, existing benchmarks for evaluating the code understanding and generation capacities…

Computation and Language · Computer Science 2024-06-10 Weixiang Yan , Haitian Liu , Yunkun Wang , Yunzhe Li , Qian Chen , Wen Wang , Tingyu Lin , Weishan Zhao , Li Zhu , Hari Sundaram , Shuiguang Deng

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of…

Computation and Language · Computer Science 2025-02-25 Alexander Zhang , Marcus Dong , Jiaheng Liu , Wei Zhang , Yejie Wang , Jian Yang , Ge Zhang , Tianyu Liu , Zhongyuan Peng , Yingshui Tan , Yuanxing Zhang , Zhexu Wang , Weixun Wang , Yancheng He , Ken Deng , Wangchunshu Zhou , Wenhao Huang , Zhaoxiang Zhang

EnStack: An Ensemble Stacking Framework of Large Language Models for Enhanced Vulnerability Detection in Source Code

Automated detection of software vulnerabilities is critical for enhancing security, yet existing methods often struggle with the complexity and diversity of modern codebases. In this paper, we introduce EnStack, a novel ensemble stacking…

Software Engineering · Computer Science 2024-11-26 Shahriyar Zaman Ridoy , Md. Shazzad Hossain Shaon , Alfredo Cuzzocrea , Mst Shapna Akter

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu