English
Related papers

Related papers: McEval: Massively Multilingual Code Evaluation

200 papers

Recently, pre-trained large language models (LLMs) have shown impressive abilities in generating codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments.…

Computation and Language · Computer Science 2023-11-07 Mohammad Abdullah Matin Khan , M Saiful Bari , Xuan Long Do , Weishi Wang , Md Rizwan Parvez , Shafiq Joty

Code large language models (LLMs) have made significant progress in code debugging by directly generating the correct code based on the buggy code snippet. Programming benchmarks, typically consisting of buggy code snippet and their…

Recent advancements in large language models (LLMs) have significantly enhanced code generation from natural language prompts. The HumanEval Benchmark, developed by OpenAI, remains the most widely used code generation benchmark. However,…

Computation and Language · Computer Science 2025-05-19 Nishat Raihan , Antonios Anastasopoulos , Marcos Zampieri

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

Code benchmarks such as HumanEval are widely adopted to evaluate Large Language Models' (LLMs) coding capabilities. However, there is an unignorable programming language bias in existing code benchmarks -- over 95% code generation…

Artificial Intelligence · Computer Science 2025-05-20 Ruiyang Xu , Jialun Cao , Yaojie Lu , Ming Wen , Hongyu Lin , Xianpei Han , Ben He , Shing-Chi Cheung , Le Sun

Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…

Software Engineering · Computer Science 2026-01-08 Danny Brahman , Mohammad Mahoor

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge…

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to…

Computation and Language · Computer Science 2024-03-26 Qiwei Peng , Yekun Chai , Xuhong Li

The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural…

Computation and Language · Computer Science 2025-12-30 Fanglin Xu , Wei Zhang , Jian Yang , Guo Chen , Aishan Liu , Zhoujun Li , Xianglong Liu , Bryan Dai

Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…

Software Engineering · Computer Science 2025-02-04 Wenhan Wang , Chenyuan Yang , Zhijie Wang , Yuheng Huang , Zhaoyang Chu , Da Song , Lingming Zhang , An Ran Chen , Lei Ma

Existing code generation benchmarks for Large Language Models (LLMs) such as HumanEval and MBPP are designed to study LLMs' end-to-end performance, where the benchmarks feed a problem description in natural language as input and examine the…

Software Engineering · Computer Science 2025-02-27 Jiarong Wu , Songqiang Chen , Jialun Cao , Hau Ching Lo , Shing-Chi Cheung

In recent years, large language models (LLMs) have showcased significant advancements in code generation. However, most evaluation benchmarks are primarily oriented towards Python, making it difficult to evaluate other programming…

Machine Learning · Computer Science 2025-06-02 Ivan Petrukha , Yana Kurliak , Nataliia Stulova

Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on…

Artificial Intelligence · Computer Science 2024-08-26 Qiming Zhu , Jialun Cao , Yaojie Lu , Hongyu Lin , Xianpei Han , Le Sun , Shing-Chi Cheung

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-05-15 Daniel Nichols , Joshua H. Davis , Zhaojun Xie , Arjun Rajaram , Abhinav Bhatele

Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation…

Computation and Language · Computer Science 2024-12-17 Bowen Li , Wenhan Wu , Ziwei Tang , Lin Shi , John Yang , Jinyang Li , Shunyu Yao , Chen Qian , Binyuan Hui , Qicheng Zhang , Zhiyin Yu , He Du , Ping Yang , Dahua Lin , Chao Peng , Kai Chen

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner. Despite promising…

Repository-level code completion has drawn great attention in software engineering, and several benchmark datasets have been introduced. However, existing repository-level code completion benchmarks usually focus on a limited number of…

Computation and Language · Computer Science 2024-10-29 Jiaheng Liu , Ken Deng , Congnan Liu , Jian Yang , Shukai Liu , He Zhu , Peng Zhao , Linzheng Chai , Yanan Wu , Ke Jin , Ge Zhang , Zekun Wang , Guoan Zhang , Bangyu Xiang , Wenbo Su , Bo Zheng

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…

Software Engineering · Computer Science 2025-06-26 Shanchao Liang , Yiran Hu , Nan Jiang , Lin Tan

Evaluating the performance of Code Language Models (CLMs) for software engineering tasks, especially in multilingual and low-resource programming language settings, poses significant challenges. These challenges are primarily due to the…

Software Engineering · Computer Science 2024-11-26 Rohit Dandamudi , Gema Rodríguez-Pérez
‹ Prev 1 2 3 10 Next ›