Related papers: OOP: Object-Oriented Programming Evaluation Benchm…
Establishing fair and robust benchmarks is essential for evaluating intelligent code generation by large language models (LLMs). Our survey of 35 existing benchmarks uncovers three major imbalances: 85.7% focus on a single programming…
Code generation benchmarks such as HumanEval are widely adopted to evaluate LLMs' capabilities. However, after consolidating the latest 24 benchmarks, we noticed three significant imbalances. First, imbalanced programming language. 95.8% of…
Recent advances in large language models (LLMs) have driven extensive evaluations in software engineering. however, most prior work concentrates on code-level tasks, leaving software design capabilities underexplored. To fill this gap, we…
We find ourselves in the midst of an explosion in artificial intelligence research, particularly with large language models (LLMs). These models have diverse applications spanning finance, commonsense knowledge graphs, medicine, and visual…
Large Language Models (LLMs) have emerged as promising tools to assist students while solving programming assignments. However, object-oriented programming (OOP), with its inherent complexity involving the identification of entities,…
Recent advancements in large language models (LLMs) have greatly improved code generation, specifically at the function level. For instance, GPT-4o has achieved a 91.0\% pass rate on HumanEval. However, this draws into question the adequacy…
In the area of code generation research, the emphasis has transitioned from crafting individual functions to developing class-level method code that integrates contextual information. This shift has brought several benchmarks such as…
Evaluating the programming robustness of large language models (LLMs) is paramount for ensuring their reliability in AI-based software development. However, adversarial attacks exhibit fundamental limitations that compromise fair robustness…
Object-Oriented Programming (OOP) has become a crucial paradigm for managing the growing complexity of modern software systems, particularly in fields like machine learning, deep learning, large language models (LLM), and data analytics.…
Large Language Models (LLMs) are predominantly assessed based on their common sense reasoning, language comprehension, and logical reasoning abilities. While models trained in specialized domains like mathematics or coding have demonstrated…
Growing renewable penetration introduces substantial uncertainty into power system operations, necessitating frequent adaptation of dispatch objectives and constraints and challenging expertise-intensive, near-real-time modeling workflows.…
Information Technology (IT) Operations (Ops), particularly Artificial Intelligence for IT Operations (AIOps), is the guarantee for maintaining the orderly and stable operation of existing information systems. According to Gartner's…
The rapid progress of artificial intelligence increasingly relies on efficient integrated circuit (IC) design. Recent studies have explored the use of large language models (LLMs) for generating Register Transfer Level (RTL) code, but…
Large language models (LLMs) have increasingly been applied to automatic programming code generation. This task can be viewed as a language generation task that bridges natural language, human knowledge, and programming logic. However, it…
Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with…
We introduce self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They…
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs capabilities. We conducted a large-scale human evaluation of HumanEval and MBPP, two popular benchmarks for…
Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like…
Multi-objective optimization problems (MOPs) are ubiquitous in real-world applications, presenting a complex challenge of balancing multiple conflicting objectives. Traditional evolutionary algorithms (EAs), though effective, often rely on…
The application of large language models (LLMs) in the field of coding is evolving rapidly: from code assistants, to autonomous coding agents, and then to generating complete projects through natural language. Early LLM code benchmarks…