Related papers: ConvCodeWorld: Benchmarking Conversational Code Ge…

QCoder Benchmark: Bridging Language Generation and Quantum Hardware through Simulator-Based Feedback

Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it…

Computation and Language · Computer Science 2025-11-04 Taku Mikuriya , Tatsuya Ishigaki , Masayuki Kawarada , Shunya Minami , Tadashi Kadowaki , Yohichi Suzuki , Soshun Naito , Shunya Takata , Takumi Kato , Tamotsu Basseda , Reo Yamada , Hiroya Takamura

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the…

Computation and Language · Computer Science 2025-10-27 Chunyu Miao , Henry Peng Zou , Yangning Li , Yankai Chen , Yibo Wang , Fangxin Wang , Yifan Li , Wooseong Yang , Bowei He , Xinni Zhang , Dianzhi Yu , Hanchen Yang , Hoang H Nguyen , Yue Zhou , Jie Yang , Jizhou Guo , Wenzhe Fan , Chin-Yuan Yeh , Panpan Meng , Liancheng Fang , Jinhu Qi , Wei-Chieh Huang , Zhengyao Gu , Yuwei Han , Langzhou He , Yuyao Yang , Yinghui Li , Hai-Tao Zheng , Xue Liu , Irwin King , Philip S. Yu

EvoCodeBench: An Evolving Code Generation Benchmark Aligned with Real-World Code Repositories

How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper…

Computation and Language · Computer Science 2024-04-02 Jia Li , Ge Li , Xuanming Zhang , Yihong Dong , Zhi Jin

FeedbackEval: A Benchmark for Evaluating Large Language Models in Feedback-Driven Code Repair Tasks

Code repair is a fundamental task in software development, facilitating efficient bug resolution and software maintenance. Although large language models (LLMs) have demonstrated considerable potential in automated code repair, their…

Software Engineering · Computer Science 2026-02-27 Dekun Dai , MingWei Liu , Anji Li , Jialun Cao , Yanlin Wang , Chong Wang , Xin Peng , Zibin Zheng

Can Language Models Replace Programmers for Coding? REPOCOD Says 'Not Yet'

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

Turning the Tide: Repository-based Code Reflection

Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and…

Software Engineering · Computer Science 2025-07-15 Wei Zhang , Jian Yang , Jiaxi Yang , Ya Wang , Zhoujun Li , Zeyu Cui , Binyuan Hui , Junyang Lin

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the…

Software Engineering · Computer Science 2025-11-25 Peiding Wang , Li Zhang , Fang Liu , Lin Shi , Minxiao Li , Bo Shen , An Fu

When Benchmarks Talk: Re-Evaluating Code LLMs with Interactive Feedback

Programming is a fundamentally interactive process, yet coding assistants are often evaluated using static benchmarks that fail to measure how well models collaborate with users. We introduce an interactive evaluation pipeline to examine…

Human-Computer Interaction · Computer Science 2025-02-26 Jane Pan , Ryan Shar , Jacob Pfau , Ameet Talwalkar , He He , Valerie Chen

DevEval: A Manually-Annotated Code Generation Benchmark Aligned with Real-World Code Repositories

How to evaluate the coding abilities of Large Language Models (LLMs) remains an open question. We find that existing benchmarks are poorly aligned with real-world code repositories and are insufficient to evaluate the coding abilities of…

Computation and Language · Computer Science 2024-05-31 Jia Li , Ge Li , Yunfei Zhao , Yongmin Li , Huanyu Liu , Hao Zhu , Lecheng Wang , Kaibo Liu , Zheng Fang , Lanshen Wang , Jiazheng Ding , Xuanming Zhang , Yuqi Zhu , Yihong Dong , Zhi Jin , Binhua Li , Fei Huang , Yongbin Li

PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback

Large Language Models (LLMs) are widely adopted for assisting in software development tasks, yet their performance evaluations have narrowly focused on the functional correctness of generated code. Human programmers, however, require…

Software Engineering · Computer Science 2024-12-06 Yun Peng , Akhilesh Deepak Gotmare , Michael Lyu , Caiming Xiong , Silvio Savarese , Doyen Sahoo

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian

Benchmarking and Revisiting Code Generation Assessment: A Mutation-Based Approach

Code Large Language Models (CLLMs) have exhibited outstanding performance in program synthesis, attracting the focus of the research community. The evaluation of CLLM's program synthesis capability has generally relied on manually curated…

Software Engineering · Computer Science 2025-05-13 Longtian Wang , Tianlin Li , Xiaofei Xie , Yuhan Zhi , Jian Wang , Chao Shen

ConvBench: A Multi-Turn Conversation Evaluation Benchmark with Hierarchical Capability for Large Vision-Language Models

This paper presents ConvBench, a novel multi-turn conversation evaluation benchmark tailored for Large Vision-Language Models (LVLMs). Unlike existing benchmarks that assess individual capabilities in single-turn dialogues, ConvBench adopts…

Multimedia · Computer Science 2024-04-26 Shuo Liu , Kaining Ying , Hao Zhang , Yue Yang , Yuqi Lin , Tianle Zhang , Chuanhao Li , Yu Qiao , Ping Luo , Wenqi Shao , Kaipeng Zhang

Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development in which multimodal large language…

Computation and Language · Computer Science 2025-02-11 Chenglei Si , Yanzhe Zhang , Ryan Li , Zhengyuan Yang , Ruibo Liu , Diyi Yang

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a…

Computation and Language · Computer Science 2026-05-27 Victor M. dos Santos , Andre C. Castro , Samuel L. de S. Toledo , Bruno M. L. Calura , Lisandra C. de M. Menezes , Raul C. R. Mata , Telma W. de L. Soares , Bryan L. M. de Oliveira

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

An Empirical Study of Interaction Smells in Multi-Turn Human-LLM Collaborative Code Generation

Large Language Models (LLMs) have revolutionized code generation, evolving from static tools into dynamic conversational interfaces that facilitate complex, multi-turn collaborative programming. While LLMs exhibit remarkable proficiency in…

Software Engineering · Computer Science 2026-03-31 Binquan Zhang , Li Zhang , Lin Shi , Song Wang , Yuwei Qian , Linhui Zhao , Fang Liu , An Fu , Yida Ye

WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging…

Software Engineering · Computer Science 2026-03-17 Chenxu Liu , Yingjie Fu , Wei Yang , Ying Zhang , Tao Xie

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li