English
Related papers

Related papers: Top General Performance = Top Domain Performance? …

200 papers

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on…

Artificial Intelligence · Computer Science 2024-08-26 Qiming Zhu , Jialun Cao , Yaojie Lu , Hongyu Lin , Xianpei Han , Le Sun , Shing-Chi Cheung

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real…

Machine Learning · Computer Science 2026-05-19 Adarsh Kumarappan , Pareesa Ameneh Golnari , Wen Wen , Xiaoyu Liu , Gabriel Ryan , Yuting Sun , Shengyu Fu , Elsie Nallipogu

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing…

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…

Large language models (LLMs) such as ChatGPT have shown remarkable capabilities in code generation. Despite significant achievements, they rely on enormous training data to acquire a broad spectrum of open-domain knowledge. Besides, their…

Software Engineering · Computer Science 2025-02-18 Xiaodong Gu , Meng Chen , Yalan Lin , Yuhan Hu , Hongyu Zhang , Chengcheng Wan , Zhao Wei , Yong Xu , Juhong Wang

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the…

Computation and Language · Computer Science 2024-10-31 Jia Li , Ge Li , Xuanming Zhang , Yunfei Zhao , Yihong Dong , Zhi Jin , Binhua Li , Fei Huang , Yongbin Li

Large language models (LLMs) excel at general programming but struggle with domain-specific software development, necessitating domain specialization methods for LLMs to learn and utilize domain knowledge and data. However, existing…

Software Engineering · Computer Science 2026-04-28 Xue Jiang , Ge Li , Jiaru Qian , Xianjie Shi , Chenjie Li , Hao Zhu , Ziyu Wang , Jielun Zhang , Zheyu Zhao , Lingwei Wu , Kechi Zhang , Jia Li , Wenpin Jiao , Zhi Jin , Yihong Dong

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large…

Computation and Language · Computer Science 2025-09-09 Yilun Yang , Yekun Chai

As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…

Software Engineering · Computer Science 2025-11-03 Forough Mehralian , Ryan Shar , James R. Rae , Alireza Hashemi

Large language models (LLMs) can generate code from natural language, but the extent to which they capture intended program behavior remains unclear. Executable behavioral specifications, defined via preconditions and postconditions,…

Software Engineering · Computer Science 2026-04-15 Zaoyu Chen , Jianbo Dai , Boyu Zhu , Jingdong Wang , Huiming Wang , Xin Xu , Haoyang Yuan , Zhijiang Guo , Xiao-Ming Wu

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the…

Software Engineering · Computer Science 2026-03-17 Jingyu Xiao , Ming Wang , Man Ho Lam , Yuxuan Wan , Junliang Liu , Yintong Huo , Michael R. Lyu

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress using Large Language Models (LLMs) for code generation. Many benchmarks like HumanEval and…

Software Engineering · Computer Science 2026-04-27 Jia Li , Hongyi Deng , Yiran Zhang , Kechi Zhang , Tianqi Shao , Tiankuo Zhao , Weinan Wang , Zhi Jin , Ge Li , Yang Liu , Yingtao Fang , Yihong Dong

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Recent advancements in large language models (LLMs) have significantly enhanced text generation capabilities, yet evaluating their performance in generative writing remains a challenge. Existing benchmarks primarily focus on generic text…

Artificial Intelligence · Computer Science 2025-12-01 Yuning Wu , Jiahao Mei , Ming Yan , Chenliang Li , Shaopeng Lai , Yuran Ren , Zijia Wang , Ji Zhang , Mengyue Wu , Qin Jin , Fei Huang

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic…

Software Engineering · Computer Science 2025-11-18 Shuyin Ouyang , Dong Huang , Jingwen Guo , Zeyu Sun , Qihao Zhu , Jie M. Zhang

Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging…

Software Engineering · Computer Science 2026-03-17 Chenxu Liu , Yingjie Fu , Wei Yang , Ying Zhang , Tao Xie
‹ Prev 1 2 3 10 Next ›