Related papers: PECC: Problem Extraction and Coding Challenges

PEA: Enhancing LLM Performance on Computational-Reasoning Tasks

Large Language Models (LLMs) have exhibited remarkable capabilities across diverse domains, prompting investigations into their potential as generic reasoning engines. While recent studies have explored inference-time computation to enhance…

Artificial Intelligence · Computer Science 2025-02-18 Zi Wang , Shiwei Weng , Mohannad Alhanahnah , Somesh Jha , Tom Reps

Navigating the Labyrinth: Evaluating LLMs' Ability to Reason About Search Problems

Large Language Models (LLMs) have recently achieved impressive performance in math and reasoning benchmarks. However, they often struggle with logic problems and puzzles that are relatively easy for humans. To further investigate this, we…

Artificial Intelligence · Computer Science 2025-09-16 Nasim Borazjanizadeh , Roei Herzig , Trevor Darrell , Rogerio Feris , Leonid Karlinsky

ICPC-Eval: Probing the Frontiers of LLM Reasoning with Competitive Programming Contests

With the significant progress of large reasoning models in complex coding and reasoning tasks, existing benchmarks, like LiveCodeBench and CodeElo, are insufficient to evaluate the coding capabilities of large language models (LLMs) in real…

Computation and Language · Computer Science 2025-06-06 Shiyi Xu , Yiwen Hu , Yingqian Min , Zhipeng Chen , Wayne Xin Zhao , Ji-Rong Wen

LLM-ProS: Analyzing Large Language Models' Performance in Competitive Problem Solving

The rapid advancement of large language models has opened new avenues for automating complex problem-solving tasks such as algorithmic coding and competitive programming. This paper introduces a novel evaluation technique, LLM-ProS, to…

Computation and Language · Computer Science 2026-03-03 Md Sifat Hossain , Anika Tabassum , Md. Fahim Arefin , Tarannum Shaila Zaman

AetherCode: Evaluating LLMs' Ability to Win In Premier Programming Competitions

Competitive programming has emerged as a critical benchmark for evaluating the reasoning and coding capabilities of Large Language Models (LLMs). Despite impressive progress on existing benchmarks, we argue that current evaluations…

Software Engineering · Computer Science 2025-08-25 Zihan Wang , Jiaze Chen , Zhicheng Liu , Markus Mak , Yidi Du , Geonsik Moon , Luoqi Xu , Aaron Tua , Kunshuo Peng , Jiayi Lu , Mingfei Xia , Boqian Zou , Chenyang Ran , Guang Tian , Shoutai Zhu , Yeheng Duan , Zhenghui Kang , Zhenxing Lin , Shangshu Li , Qiang Luo , Qingshen Long , Zhiyong Chen , Yihan Xiao , Yurong Wu , Daoguang Zan , Yuyi Fu , Mingxuan Wang , Ming Ding

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

RealMath: A Continuous Benchmark for Evaluating Language Models on Research-Level Mathematics

Existing benchmarks for evaluating mathematical reasoning in large language models (LLMs) rely primarily on competition problems, formal proofs, or artificially challenging questions -- failing to capture the nature of mathematics…

Artificial Intelligence · Computer Science 2025-10-21 Jie Zhang , Cezara Petrui , Kristina Nikolić , Florian Tramèr

Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEBench, a considerably more challenging benchmark dataset for evaluating the problem…

Computation and Language · Computer Science 2023-10-24 Daman Arora , Himanshu Gaurav Singh , Mausam

Evaluating and Improving Large Language Models for Competitive Program Generation

Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is…

Social and Information Networks · Computer Science 2025-07-01 Minnan Wei , Ziming Li , Xiang Chen , Menglin Zheng , Ziyan Qu , Cheng Yu , Siyu Chen , Xiaolin Ju

AECBench: A Hierarchical Benchmark for Knowledge Evaluation of Large Language Models in the AEC Field

Large language models (LLMs), as a novel information technology, are seeing increasing adoption in the Architecture, Engineering, and Construction (AEC) field. They have shown their potential to streamline processes throughout the building…

Computation and Language · Computer Science 2026-02-17 Chen Liang , Zhaoqi Huang , Haofen Wang , Fu Chai , Chunying Yu , Huanhuan Wei , Zhengjie Liu , Yanpeng Li , Hongjun Wang , Ruifeng Luo , Xianzhong Zhao

Through the Lens of Core Competency: Survey on Evaluation of Large Language Models

From pre-trained language model (PLM) to large language model (LLM), the field of natural language processing (NLP) has witnessed steep performance gains and wide practical uses. The evaluation of a research field guides its direction of…

Computation and Language · Computer Science 2023-08-16 Ziyu Zhuang , Qiguang Chen , Longxuan Ma , Mingda Li , Yi Han , Yushan Qian , Haopeng Bai , Zixian Feng , Weinan Zhang , Ting Liu

Beyond Accuracy: Characterizing Code Comprehension Capabilities in (Large) Language Models

Large Language Models (LLMs) are increasingly integrated into software engineering workflows, yet current benchmarks provide only coarse performance summaries that obscure the diverse capabilities and limitations of these models. This paper…

Software Engineering · Computer Science 2026-01-21 Felix Mächtle , Jan-Niclas Serr , Nils Loose , Thomas Eisenbarth

MathConstruct: Challenging LLM Reasoning with Constructive Proofs

While Large Language Models (LLMs) demonstrate impressive performance in mathematics, existing math benchmarks come with significant limitations. Many focus on problems with fixed ground-truth answers, and are often saturated due to problem…

Artificial Intelligence · Computer Science 2025-10-02 Mislav Balunović , Jasper Dekoninck , Nikola Jovanović , Ivo Petrov , Martin Vechev

PEAK: A Performance Engineering AI-Assistant for GPU Kernels Powered by Natural Language Transformations

Advancements in large language models (LLMs) are showing promising impact in software development and programming assistance. However, these models struggle when operating on low-level backend code. This challenge is exacerbated in the…

Software Engineering · Computer Science 2025-12-23 Muhammad Usman Tariq , Abhinav Jangda , Angelica Moreira , Madan Musuvathi , Tyler Sorensen

Large Language Models Struggle with Unreasonability in Math Problems

Large Language Models (LLMs) have shown remarkable success on a wide range of math and reasoning benchmarks. However, we observe that they often struggle when faced with unreasonable math problems. Instead of recognizing these issues,…

Computation and Language · Computer Science 2025-06-03 Jingyuan Ma , Damai Dai , Zihang Yuan , Rui li , Weilin Luo , Bin Wang , Qun Liu , Lei Sha , Zhifang Sui

Benchmarking Large Language Models with Integer Sequence Generation Tasks

We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced…

Machine Learning · Computer Science 2025-11-11 Daniel O'Malley , Manish Bhattarai , Nishath Rajiv Ranasinghe , Erick Draayer , Javier Santos

UGMathBench: A Diverse and Dynamic Benchmark for Undergraduate-Level Mathematical Reasoning with Large Language Models

Large Language Models (LLMs) have made significant strides in mathematical reasoning, underscoring the need for a comprehensive and fair evaluation of their capabilities. However, existing benchmarks often fall short, either lacking…

Computation and Language · Computer Science 2025-02-26 Xin Xu , Jiaxin Zhang , Tianhao Chen , Zitong Chao , Jishan Hu , Can Yang

EquiBench: Benchmarking Large Language Models' Reasoning about Program Semantics via Equivalence Checking

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

HPC-Coder: Modeling Parallel Programs using Large Language Models

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Aniruddha Marathe , Harshitha Menon , Todd Gamblin , Abhinav Bhatele

Large Language Models for Code Generation: The Practitioners Perspective

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Software Engineering · Computer Science 2025-01-29 Zeeshan Rasheed , Muhammad Waseem , Kai Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Jussi Rasku , Kari Systä , Pekka Abrahamsson