Related papers: ProBench: Benchmarking Large Language Models in Co…

OJBench: A Competition Level Code Benchmark For Large Language Models

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these…

Computation and Language · Computer Science 2026-03-03 Zhexu Wang , Yiping Liu , Yejie Wang , Wenyang He , Bofei Gao , Muxi Diao , Yanxu Chen , Kelin Fu , Flood Sung , Zhilin Yang , Tianyu Liu , Weiran Xu

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real…

Computation and Language · Computer Science 2025-06-06 Shiyi Xu , Yiwen Hu , Yingqian Min , Zhipeng Chen , Wayne Xin Zhao , Ji-Rong Wen

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

Evaluating and Improving Large Language Models for Competitive Program Generation

Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is…

Social and Information Networks · Computer Science 2025-07-01 Minnan Wei , Ziming Li , Xiang Chen , Menglin Zheng , Ziyan Qu , Cheng Yu , Siyu Chen , Xiaolin Ju

seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs

We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation…

Artificial Intelligence · Computer Science 2025-09-23 Mohammad Ramezanali , Mo Vazifeh , Paolo Santi

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by…

Computation and Language · Computer Science 2025-08-15 Hongchao Jiang , Yiming Chen , Yushi Cao , Hung-yi Lee , Robby T. Tan

LiveOIBench: Can Large Language Models Outperform Human Contestants in Informatics Olympiads?

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations…

Artificial Intelligence · Computer Science 2025-12-23 Kaijian Zou , Aaron Xiong , Yunxiang Zhang , Frederick Zhang , Yueqi Ren , Jirong Yang , Ayoung Lee , Shitanshu Bhushan , Lu Wang

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

ResearchCodeBench: Benchmarking LLMs on Implementing Novel Machine Learning Research Code

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce…

Artificial Intelligence · Computer Science 2025-06-04 Tianyu Hua , Harper Hua , Violet Xiang , Benjamin Klieger , Sang T. Truong , Weixin Liang , Fan-Yun Sun , Nick Haber

IOLBENCH: Benchmarking LLMs on Linguistic Reasoning

Despite the remarkable advancements and widespread applications of deep neural networks, their ability to perform reasoning tasks remains limited, particularly in domains requiring structured, abstract thought. In this paper, we investigate…

Computation and Language · Computer Science 2025-09-16 Satyam Goyal , Soham Dan

CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code…

Software Engineering · Computer Science 2025-03-07 Julian Aron Prenner , Romain Robbes

CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that…

Computation and Language · Computer Science 2025-01-06 Shanghaoran Quan , Jiaxi Yang , Bowen Yu , Bo Zheng , Dayiheng Liu , An Yang , Xuancheng Ren , Bofei Gao , Yibo Miao , Yunlong Feng , Zekun Wang , Jian Yang , Zeyu Cui , Yang Fan , Yichang Zhang , Binyuan Hui , Junyang Lin

EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic…

Artificial Intelligence · Computer Science 2026-05-05 Xiyuan Zhou , Xinlei Wang , Yirui He , Yang Wu , Ruixi Zou , Yuheng Cheng , Yulu Xie , Wenxuan Liu , Huan Zhao , Yan Xu , Jinjin Gu , Junhua Zhao

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

IMProofBench: Benchmarking AI on Research-Level Mathematical Proof Generation

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are…

Computation and Language · Computer Science 2025-10-01 Johannes Schmitt , Gergely Bérczi , Jasper Dekoninck , Jeremy Feusi , Tim Gehrunger , Raphael Appenzeller , Jim Bryan , Niklas Canova , Timo de Wolff , Filippo Gaia , Michel van Garrel , Baran Hashemi , David Holmes , Aitor Iribar Lopez , Victor Jaeck , Martina Jørgensen , Steven Kelk , Stefan Kuhlmann , Adam Kurpisz , Chiara Meroni , Ingmar Metzler , Martin Möller , Samuel Muñoz-Echániz , Robert Nowak , Georg Oberdieck , Daniel Platt , Dylan Possamaï , Gabriel Ribeiro , Raúl Sánchez Galán , Zheming Sun , Josef Teichmann , Richard P. Thomas , Charles Vial