Related papers: WebCoderBench: Benchmarking Web Application Genera…

FrontendBench: A Benchmark for Evaluating LLMs on Front-End Development via Automatic Evaluation

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

WebTestBench: Evaluating Computer-Use Agents towards End-to-End Automated Web Testing

The emergence of Large Language Models (LLMs) has catalyzed a paradigm shift in programming, giving rise to "vibe coding", where users can build complete projects and even control computers using natural language instructions. This paradigm…

Software Engineering · Computer Science 2026-03-27 Fanheng Kong , Jingyuan Zhang , Yang Yue , Chenxi Sun , Yang Tian , Shi Feng , Xiaocui Yang , Daling Wang , Yu Tian , Jun Du , Wenchong Zeng , Han Li , Kun Gai

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code

With the rapid advancement of Generative AI technology, Multimodal Large Language Models(MLLMs) have the potential to act as AI software engineers capable of executing complex web application development. Considering that the model requires…

Computation and Language · Computer Science 2025-06-10 Zhiyu Lin , Zhengda Zhou , Zhiyuan Zhao , Tianrui Wan , Yilun Ma , Junyu Gao , Xuelong Li

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the…

Software Engineering · Computer Science 2026-03-17 Jingyu Xiao , Ming Wang , Man Ho Lam , Yuxuan Wan , Junliang Liu , Yintong Huo , Michael R. Lyu

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

WebApp1K: A Practical Code-Generation Benchmark for Web App Development

We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The…

Software Engineering · Computer Science 2024-08-02 Yi Cui

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Software Engineering · Computer Science 2025-04-02 Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan , Junda He , Indraneil Paul , Simon Brunner , Chen Gong , Thong Hoang , Armel Randy Zebaze , Xiaoheng Hong , Wen-Ding Li , Jean Kaddour , Ming Xu , Zhihan Zhang , Prateek Yadav , Naman Jain , Alex Gu , Zhoujun Cheng , Jiawei Liu , Qian Liu , Zijian Wang , Binyuan Hui , Niklas Muennighoff , David Lo , Daniel Fried , Xiaoning Du , Harm de Vries , Leandro Von Werra

Insights from Benchmarking Frontier Language Models on Web App Code Generation

This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…

Software Engineering · Computer Science 2024-09-10 Yi Cui

WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning

Automating the conversion of UI images into web code is a critical task for front-end development and rapid prototyping. Advances in multimodal large language models (MLLMs) have made WebUI-to-Code increasingly feasible, yet existing…

Artificial Intelligence · Computer Science 2025-10-10 Peichao Lai , Jinhui Zhuang , Kexuan Zhang , Ningchang Xiong , Shengjie Wang , Yanwei Xu , Chong Chen , Yilei Wang , Bin Cui

Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks

The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…

Artificial Intelligence · Computer Science 2025-05-13 Kai Xu , YiWei Mao , XinYi Guan , ZiLong Feng

MermaidSeqBench: An Evaluation Benchmark for NL-to-Mermaid Sequence Diagram Generation

Large language models (LLMs) have shown great promise in generating structured diagrams from natural language descriptions, particularly Mermaid sequence diagrams for software engineering. However, the lack of existing benchmarks to assess…

Software Engineering · Computer Science 2026-04-28 Basel Shbita , Farhan Ahmed , Chad DeLuca

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Large Language Models (LLMs) applied to code-related applications have emerged as a prominent field, attracting significant interest from both academia and industry. However, as new and improved LLMs are developed, existing evaluation…

Software Engineering · Computer Science 2024-06-07 Naman Jain , King Han , Alex Gu , Wen-Ding Li , Fanjia Yan , Tianjun Zhang , Sida Wang , Armando Solar-Lezama , Koushik Sen , Ion Stoica

Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review

With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to…

Artificial Intelligence · Computer Science 2024-06-19 Debalina Ghosh Paul , Hong Zhu , Ian Bayley

Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack…

Software Engineering · Computer Science 2026-01-01 Ruida Hu , Xinchen Wang , Xin-Cheng Wen , Zhao Zhang , Bo Jiang , Pengfei Gao , Chao Peng , Cuiyun Gao

Large Language Models for Code Generation: The Practitioners Perspective

Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…

Software Engineering · Computer Science 2025-01-29 Zeeshan Rasheed , Muhammad Waseem , Kai Kristian Kemell , Aakash Ahmad , Malik Abdul Sami , Jussi Rasku , Kari Systä , Pekka Abrahamsson

Top General Performance = Top Domain Performance? DomainCodeBench: A Multi-domain Code Generation Benchmark

With the rapid advancement of large language models (LLMs), extensive research has been conducted to investigate the code generation capabilities of LLMs. However, existing efforts primarily focus on general-domain tasks, leaving LLMs' code…

Software Engineering · Computer Science 2025-03-18 Dewu Zheng , Yanlin Wang , Ensheng Shi , Xilin Liu , Yuchi Ma , Hongyu Zhang , Zibin Zheng

PyBench: Evaluating LLM Agent on various real-world coding tasks

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as…

Software Engineering · Computer Science 2024-08-06 Yaolun Zhang , Yinxu Pan , Yudong Wang , Jie Cai

AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Computation and Language · Computer Science 2025-08-13 Jason Chou , Ao Liu , Yuchi Deng , Zhiying Zeng , Tao Zhang , Haotian Zhu , Jianwei Cai , Yue Mao , Chenchen Zhang , Lingyun Tan , Ziyan Xu , Bohui Zhai , Hengyi Liu , Speed Zhu , Wiggin Zhou , Fengzong Lian