Related papers: Tests as Prompt: A Test-Driven-Development Benchma…
This paper presents insights from evaluating 16 frontier large language models (LLMs) on the WebApp1K benchmark, a test suite designed to assess the ability of LLMs to generate web application code. The results reveal that while all models…
We introduce WebApp1K, a practical code-generation benchmark to measure LLM ability to develop web apps. This benchmark aims to calibrate LLM output and aid the models to progressively improve code correctness and functionality. The…
Recent Large Language Models (LLMs) have demonstrated significant capabilities in generating code snippets directly from problem statements. This increasingly automated process mirrors traditional human-led software development, where code…
In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related…
Large Language Models (LLMs) have emerged as coding assistants, capable of generating source code from natural language prompts. With the increasing adoption of LLMs in software development, academic research and industry based projects are…
Software testing is a crucial phase in the software life cycle, helping identify potential risks and reduce maintenance costs. With the advancement of Large Language Models (LLMs), researchers have proposed an increasing number of LLM-based…
The widespread adoption of web applications has made their security a critical concern and has increased the need for systematic ways to assess whether they can be considered trustworthy. However, "trust" assessment remains an open problem…
Large Language Models (LLMs) have made significant strides in front-end code generation. However, existing benchmarks exhibit several critical limitations: many tasks are overly simplistic, test cases often lack rigor, and end-to-end…
Large Language Models (LLMs), such as ChatGPT, are increasingly leveraged for generating both traditional software code and spreadsheet logic. Despite their impressive generative capabilities, these models frequently exhibit critical issues…
Large Language Models (LLMs) are nowadays extensively used for various types of software engineering tasks, primarily code generation. Previous research has shown how suitable prompt engineering could help developers in improving their code…
Web applications (web apps) have become a key arena for large language models (LLMs) to demonstrate their code generation capabilities and commercial potential. However, building a benchmark for LLM-generated web apps remains challenging…
Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is…
As large language models become increasingly capable of generating code, evaluating their performance remains a complex and evolving challenge. Existing benchmarks primarily focus on functional correctness, overlooking the diversity of…
Test-driven development (TDD) has been adopted to improve Large Language Model (LLM)-based code generation by using tests as executable specifications. However, existing TDD-style code generation studies are largely limited to…
Testing plays a crucial role in the software development cycle, enabling the detection of bugs, vulnerabilities, and other undesirable behaviors. To perform software testing, testers need to write code snippets that execute the program…
The capabilities of Large Language Models (LLMs) in code generation have been extensively studied, particularly for implementing target functionalities from natural-language descriptions. Alternatively, input-output (I/O) examples provide…
Large Language Models (LLMs) excel in code-related tasks like code generation, but benchmark evaluations often overlook task characteristics, such as difficulty. Moreover, benchmarks are usually built using tasks described with a single…
Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and…
This study investigates the reliability of code generation by Large Language Models (LLMs), focusing on identifying and analyzing defects in the generated code. Despite the advanced capabilities of LLMs in automating code generation,…
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have…