Related papers: CursorCore: Assist Programming through Aligning An…

Can Large Language Models Write Parallel Code?

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Joshua H. Davis , Zhaojun Xie , Arjun Rajaram , Abhinav Bhatele

CORE: Collaborative Reasoning via Cross Teaching

Large language models exhibit complementary reasoning errors: on the same instance, one model may succeed with a particular decomposition while another fails. We propose Collaborative Reasoning (CORE), a training-time collaboration…

Artificial Intelligence · Computer Science 2026-01-30 Kshitij Mishra , Mirat Aubakirov , Martin Takac , Nils Lukas , Salem Lahlou

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Programming assistants powered by large language models have improved dramatically, yet existing benchmarks still evaluate them in narrow code-generation settings. Recent efforts such as InfiBench and StackEval rely on Stack Overflow…

Software Engineering · Computer Science 2026-01-16 Myeongsoo Kim , Shweta Garg , Baishakhi Ray , Varun Kumar , Anoop Deoras

INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models

Instruction-tuned large language models have revolutionized natural language processing and have shown great potential in applications such as conversational agents. These models, such as GPT-4, can not only master language but also solve…

Computation and Language · Computer Science 2023-06-16 Yew Ken Chia , Pengfei Hong , Lidong Bing , Soujanya Poria

CRUXEval-X: A Benchmark for Multilingual Code Reasoning, Understanding and Execution

Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation…

Artificial Intelligence · Computer Science 2025-05-20 Ruiyang Xu , Jialun Cao , Yaojie Lu , Ming Wen , Hongyu Lin , Xianpei Han , Ben He , Shing-Chi Cheung , Le Sun

CodeEval: A pedagogical approach for targeted evaluation of code-trained Large Language Models

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

CodeShell Technical Report

Code large language models mark a pivotal breakthrough in artificial intelligence. They are specifically crafted to understand and generate programming languages, significantly boosting the efficiency of coding development workflows. In…

Software Engineering · Computer Science 2024-03-26 Rui Xie , Zhengran Zeng , Zhuohao Yu , Chang Gao , Shikun Zhang , Wei Ye

CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models

Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To evaluate the…

Software Engineering · Computer Science 2024-02-26 Hao Yu , Bo Shen , Dezhi Ran , Jiaxin Zhang , Qi Zhang , Yuchi Ma , Guangtai Liang , Ying Li , Qianxiang Wang , Tao Xie

Design of AI-Powered Tool for Self-Regulation Support in Programming Education

Large Language Model (LLM) tools have demonstrated their potential to deliver high-quality assistance by providing instant, personalized feedback that is crucial for effective programming education. However, many of these tools operate…

Human-Computer Interaction · Computer Science 2025-04-08 Huiyong Li , Boxuan Ma

CREF: An LLM-based Conversational Software Repair Framework for Programming Tutors

Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have…

Software Engineering · Computer Science 2024-07-09 Boyang Yang , Haoye Tian , Weiguo Pian , Haoran Yu , Haitao Wang , Jacques Klein , Tegawendé F. Bissyandé , Shunfu Jin

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation.…

Software Engineering · Computer Science 2021-03-17 Shuai Lu , Daya Guo , Shuo Ren , Junjie Huang , Alexey Svyatkovskiy , Ambrosio Blanco , Colin Clement , Dawn Drain , Daxin Jiang , Duyu Tang , Ge Li , Lidong Zhou , Linjun Shou , Long Zhou , Michele Tufano , Ming Gong , Ming Zhou , Nan Duan , Neel Sundaresan , Shao Kun Deng , Shengyu Fu , Shujie Liu

ProgCo: Program Helps Self-Correction of Large Language Models

Self-Correction aims to enable large language models (LLMs) to self-verify and self-refine their initial responses without external feedback. However, LLMs often fail to effectively self-verify and generate correct feedback, further…

Computation and Language · Computer Science 2025-05-28 Xiaoshuai Song , Yanan Wu , Weixun Wang , Jiaheng Liu , Wenbo Su , Bo Zheng

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

Competition-Level Code Generation with AlphaCode

Programming is a powerful and ubiquitous problem-solving tool. Developing systems that can assist programmers or even generate programs independently could make programming more productive and accessible, yet so far incorporating…

Programming Languages · Computer Science 2023-01-11 Yujia Li , David Choi , Junyoung Chung , Nate Kushman , Julian Schrittwieser , Rémi Leblond , Tom Eccles , James Keeling , Felix Gimeno , Agustin Dal Lago , Thomas Hubert , Peter Choy , Cyprien de Masson d'Autume , Igor Babuschkin , Xinyun Chen , Po-Sen Huang , Johannes Welbl , Sven Gowal , Alexey Cherepanov , James Molloy , Daniel J. Mankowitz , Esme Sutherland Robson , Pushmeet Kohli , Nando de Freitas , Koray Kavukcuoglu , Oriol Vinyals

CoRe: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks

Large language models (LLMs) have been widely adopted across diverse domains of software engineering, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code…

Software Engineering · Computer Science 2026-01-21 Danning Xie , Mingwei Zheng , Xuwei Liu , Jiannan Wang , Chengpeng Wang , Lin Tan , Xiangyu Zhang

A Tool for In-depth Analysis of Code Execution Reasoning of Large Language Models

Code Executing Reasoning is becoming a new non-functional metric that assesses the ability of large language models (LLMs) in programming tasks. State-of-the-art frameworks (CodeMind or REval) and benchmarks (CruxEval) usually focus on…

Software Engineering · Computer Science 2025-01-31 Changshu Liu , Reyhaneh Jabbarvand

Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization

Language models are now prevalent in software engineering with many developers using them to automate tasks and accelerate their development. While language models have been tremendous at accomplishing complex software engineering tasks,…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-21 Daniel Nichols , Konstantinos Parasyris , Charles Jekel , Abhinav Bhatele , Harshitha Menon

Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive…

Computation and Language · Computer Science 2025-03-18 Kedi Chen , Zhikai Lei , Fan Zhang , Yinqi Zhang , Qin Chen , Jie Zhou , Liang He , Qipeng Guo , Kai Chen , Wei Zhang

CREval: An Automated Interpretable Evaluation for Creative Image Manipulation under Complex Instructions

Instruction-based multimodal image manipulation has recently made rapid progress. However, existing evaluation methods lack a systematic and human-aligned framework for assessing model performance on complex and creative editing tasks. To…

Computer Vision and Pattern Recognition · Computer Science 2026-03-30 Chonghuinan Wang , Zihan Chen , Yuxiang Wei , Tianyi Jiang , Xiaohe Wu , Fan Li , Wangmeng Zuo , Hongxun Yao