English
Related papers

Related papers: BigCodeBench: Benchmarking Code Generation with Di…

200 papers

Large language models (LLMs) can often generate functionally correct code, but their ability to produce efficient implementations for performance-critical systems tasks remains limited. Existing code benchmarks mainly emphasize correctness…

Software Engineering · Computer Science 2026-05-18 Huihao Jing , Wenbin Hu , Haochen Shi , Hanyu Yang , Sirui Zhang , Shaojin Chen , Haoran Li , Yangqiu Song

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains, with code generation emerging as a key area of focus. While numerous benchmarks have been proposed to evaluate their code generation abilities,…

Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…

Software Engineering · Computer Science 2025-06-19 Hongda Zhu , Yiwen Zhang , Bing Zhao , Jingzhe Ding , Siyao Liu , Tong Liu , Dandan Wang , Yanan Liu , Zhaojian Li

Large Language Models (LLMs) have achieved remarkable success in code generation tasks, powering various applications like code completion, debugging, and programming assistance. However, existing benchmarks such as HumanEval, MBPP, and…

Machine Learning · Computer Science 2025-05-09 Manik Sheokand , Parth Sawant

Large Language Models (LLMs) have demonstrated their remarkable capabilities in numerous fields. This survey focuses on how LLMs empower users, regardless of their technical background, to use human languages to automatically generate…

Software Engineering · Computer Science 2025-04-03 Nam Huynh , Beiyu Lin

Code-mixing, the practice of switching between languages within a conversation, poses unique challenges for traditional NLP. Existing benchmarks are limited by their narrow language pairs and tasks, failing to adequately assess large…

Computation and Language · Computer Science 2025-09-09 Yilun Yang , Yekun Chai

Enhancing large language models (LLMs) with real-time APIs can help generate more accurate and up-to-date responses. However, evaluating the function calling abilities of LLMs in real-world scenarios remains under-explored due to the…

Computation and Language · Computer Science 2025-01-20 Lucen Zhong , Zhengxiao Du , Xiaohan Zhang , Haiyi Hu , Jie Tang

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This…

Programming Languages · Computer Science 2025-05-30 Thanh Le-Cong , Bach Le , Toby Murray

Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the…

Software Engineering · Computer Science 2024-11-15 Linyi Li , Shijie Geng , Zhenwen Li , Yibo He , Hao Yu , Ziyue Hua , Guanghan Ning , Siwei Wang , Tao Xie , Hongxia Yang

The LLM Agent, equipped with a code interpreter, is capable of automatically solving real-world coding tasks, such as data analysis and image editing. However, existing benchmarks primarily focus on either simplistic tasks, such as…

Software Engineering · Computer Science 2024-08-06 Yaolun Zhang , Yinxu Pan , Yudong Wang , Jie Cai

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally,…

Computation and Language · Computer Science 2025-06-12 Wei Zhang , Zhenhong Zhou , Kun Wang , Junfeng Fang , Yuanhe Zhang , Rui Wang , Ge Zhang , Xavier Li , Li Sun , Lingjuan Lyu , Yang Liu , Sen Su

Large language models (LLMs) play a crucial role in software engineering, excelling in tasks like code generation and maintenance. However, existing benchmarks are often narrow in scope, focusing on a specific task and lack a comprehensive…

Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve…

Cryptography and Security · Computer Science 2025-06-02 Mark Vero , Niels Mündler , Victor Chibotaru , Veselin Raychev , Maximilian Baader , Nikola Jovanović , Jingxuan He , Martin Vechev

As large language models (LLMs) become integral to code-related tasks, a central question emerges: Do LLMs truly understand program semantics? We introduce EquiBench, a new benchmark for evaluating LLMs through equivalence checking, i.e.,…

Machine Learning · Computer Science 2025-09-23 Anjiang Wei , Jiannan Cao , Ran Li , Hongyu Chen , Yuhui Zhang , Ziheng Wang , Yuan Liu , Thiago S. F. X. Teixeira , Diyi Yang , Ke Wang , Alex Aiken

Natural language-driven no-code development allows users to specify software functionality using natural language (NL) instead of editing source code, promising increased productivity and democratized development. Large language models…

Software Engineering · Computer Science 2025-08-19 Le Deng , Zhonghao Jiang , Jialun Cao , Michael Pradel , Zhongxin Liu

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic…

Software Engineering · Computer Science 2025-11-18 Shuyin Ouyang , Dong Huang , Jingwen Guo , Zeyu Sun , Qihao Zhu , Jie M. Zhang

The critique capacity of Large Language Models (LLMs) is essential for reasoning abilities, which can provide necessary suggestions (e.g., detailed analysis and constructive feedback). Therefore, how to evaluate the critique capacity of…

In recent years, the remarkable progress of large language models (LLMs) has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute…

Computation and Language · Computer Science 2024-11-04 Yongliang Shen , Kaitao Song , Xu Tan , Wenqi Zhang , Kan Ren , Siyu Yuan , Weiming Lu , Dongsheng Li , Yueting Zhuang

Can the rapid advances in code generation, function calling, and data analysis using large language models (LLMs) help automate the search and verification of hypotheses purely from a set of provided datasets? To evaluate this question, we…

Large Language Models (LLMs) have achieved remarkable success in code generation, and the race to improve their performance has become a central focus of AI research. Benchmarks and leaderboards are increasingly popular, offering…

Software Engineering · Computer Science 2025-11-07 Amir Molzam Sharifloo , Maedeh Heydari , Parsa Kazerooni , Daniel Maninger , Mira Mezini
‹ Prev 1 2 3 10 Next ›