English
Related papers

Related papers: ProBench: Benchmarking Large Language Models in Co…

200 papers

Recent advancements in large language models (LLMs) have demonstrated significant progress in math and code reasoning capabilities. However, existing code benchmark are limited in their ability to evaluate the full spectrum of these…

Computation and Language · Computer Science 2026-03-03 Zhexu Wang , Yiping Liu , Yejie Wang , Wenyang He , Bofei Gao , Muxi Diao , Yanxu Chen , Kelin Fu , Flood Sung , Zhilin Yang , Tianyu Liu , Weiran Xu

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real…

Computation and Language · Computer Science 2025-06-06 Shiyi Xu , Yiwen Hu , Yingqian Min , Zhipeng Chen , Wayne Xin Zhao , Ji-Rong Wen

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is…

Social and Information Networks · Computer Science 2025-07-01 Minnan Wei , Ziming Li , Xiang Chen , Menglin Zheng , Ziyan Qu , Cheng Yu , Siyu Chen , Xiaolin Ju

We introduce seqBench, a parametrized benchmark for probing sequential reasoning limits in Large Language Models (LLMs) through precise, multi-dimensional control over several key complexity dimensions. seqBench allows systematic variation…

Artificial Intelligence · Computer Science 2025-09-23 Mohammad Ramezanali , Mo Vazifeh , Paolo Santi

Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by…

Computation and Language · Computer Science 2025-08-15 Hongchao Jiang , Yiming Chen , Yushi Cao , Hung-yi Lee , Robby T. Tan

Competitive programming problems increasingly serve as valuable benchmarks to evaluate the coding capabilities of large language models (LLMs) due to their complexity and ease of verification. Yet, current coding benchmarks face limitations…

Artificial Intelligence · Computer Science 2025-12-23 Kaijian Zou , Aaron Xiong , Yunxiang Zhang , Frederick Zhang , Yueqi Ren , Jirong Yang , Ayoung Lee , Shitanshu Bhushan , Lu Wang

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

Large language models (LLMs) have shown promise in transforming machine learning research, yet their capability to faithfully implement novel ideas from recent research papers-ideas unseen during pretraining-remains unclear. We introduce…

Artificial Intelligence · Computer Science 2025-06-04 Tianyu Hua , Harper Hua , Violet Xiang , Benjamin Klieger , Sang T. Truong , Weixin Liang , Fan-Yun Sun , Nick Haber

Despite the remarkable advancements and widespread applications of deep neural networks, their ability to perform reasoning tasks remains limited, particularly in domains requiring structured, abstract thought. In this paper, we investigate…

Computation and Language · Computer Science 2025-09-16 Satyam Goyal , Soham Dan

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

Modern Large Language Models (LLMs) have shown astounding capabilities of code understanding and synthesis. In order to assess such capabilities, several benchmarks have been devised (e.g., HumanEval). However, most benchmarks focus on code…

Software Engineering · Computer Science 2025-03-07 Julian Aron Prenner , Romain Robbes

With the increasing code reasoning capabilities of existing large language models (LLMs) and breakthroughs in reasoning models like OpenAI o1 and o3, there is a growing need to develop more challenging and comprehensive benchmarks that…

Large language models (LLMs) have shown strong performance on mathematical reasoning under well-defined conditions. However, real-world engineering problems involve uncertainty, context, and open-ended settings that extend beyond symbolic…

Artificial Intelligence · Computer Science 2026-05-05 Xiyuan Zhou , Xinlei Wang , Yirui He , Yang Wu , Ruixi Zou , Yuheng Cheng , Yulu Xie , Wenxuan Liu , Huan Zhao , Yan Xu , Jinjin Gu , Junhua Zhao

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly…

Computation and Language · Computer Science 2024-06-18 Yuqing Wang , Yun Zhao

As the mathematical capabilities of large language models (LLMs) improve, it becomes increasingly important to evaluate their performance on research-level tasks at the frontier of mathematical knowledge. However, existing benchmarks are…

‹ Prev 1 2 3 10 Next ›