Related papers: Codex Hacks HackerRank: Memorization Issues and a …

Evaluating Large Language Models Trained on Code

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we…

Machine Learning · Computer Science 2021-07-15 Mark Chen , Jerry Tworek , Heewoo Jun , Qiming Yuan , Henrique Ponde de Oliveira Pinto , Jared Kaplan , Harri Edwards , Yuri Burda , Nicholas Joseph , Greg Brockman , Alex Ray , Raul Puri , Gretchen Krueger , Michael Petrov , Heidy Khlaaf , Girish Sastry , Pamela Mishkin , Brooke Chan , Scott Gray , Nick Ryder , Mikhail Pavlov , Alethea Power , Lukasz Kaiser , Mohammad Bavarian , Clemens Winter , Philippe Tillet , Felipe Petroski Such , Dave Cummings , Matthias Plappert , Fotios Chantzis , Elizabeth Barnes , Ariel Herbert-Voss , William Hebgen Guss , Alex Nichol , Alex Paino , Nikolas Tezak , Jie Tang , Igor Babuschkin , Suchir Balaji , Shantanu Jain , William Saunders , Christopher Hesse , Andrew N. Carr , Jan Leike , Josh Achiam , Vedant Misra , Evan Morikawa , Alec Radford , Matthew Knight , Miles Brundage , Mira Murati , Katie Mayer , Peter Welinder , Bob McGrew , Dario Amodei , Sam McCandlish , Ilya Sutskever , Wojciech Zaremba

A Hazard Analysis Framework for Code Synthesis Large Language Models

Codex, a large language model (LLM) trained on a variety of codebases, exceeds the previous state of the art in its capacity to synthesize and generate code. Although Codex provides a plethora of benefits, models that may generate code on…

Software Engineering · Computer Science 2022-07-29 Heidy Khlaaf , Pamela Mishkin , Joshua Achiam , Gretchen Krueger , Miles Brundage

Automatic Program Repair with OpenAI's Codex: Evaluating QuixBugs

OpenAI's Codex, a GPT-3 like model trained on a large code corpus, has made headlines in and outside of academia. Given a short user-provided description, it is capable of synthesizing code snippets that are syntactically and semantically…

Software Engineering · Computer Science 2021-11-09 Julian Aron Prenner , Romain Robbes

Unveiling Memorization in Code Models

The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of…

Software Engineering · Computer Science 2024-01-15 Zhou Yang , Zhipeng Zhao , Chenyu Wang , Jieke Shi , Dongsun Kim , DongGyun Han , David Lo

Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the…

Software Engineering · Computer Science 2025-04-01 Max Hort , Leon Moonen

Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Yet, there is a dearth of studies on the impact of real-world complexities on code reasoning, e.g., inter- or intra-procedural dependencies, API calls,…

Software Engineering · Computer Science 2026-04-27 Changshu Liu , Alireza Ghazanfari , Yang Chen , Reyhaneh Jabbarvand

Traces of Memorisation in Large Language Models for Code

Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on…

Cryptography and Security · Computer Science 2024-01-17 Ali Al-Kaswan , Maliheh Izadi , Arie van Deursen

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing…

Machine Learning · Computer Science 2025-07-15 Zhangchen Xu , Yang Liu , Yueqin Yin , Mingyuan Zhou , Radha Poovendran

Code Generation Tools (Almost) for Free? A Study of Few-Shot, Pre-Trained Language Models on Code

Few-shot learning with large-scale, pre-trained language models is a powerful way to answer questions about code, e.g., how to complete a given code example, or even generate code snippets from scratch. The success of these models raises…

Software Engineering · Computer Science 2022-06-14 Patrick Bareiß , Beatriz Souza , Marcelo d'Amorim , Michael Pradel

Multi-lingual Evaluation of Code Generation Models

We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X. These datasets cover over 10 programming languages and are generated using a scalable conversion framework that transpiles…

Machine Learning · Computer Science 2023-03-30 Ben Athiwaratkun , Sanjay Krishna Gouda , Zijian Wang , Xiaopeng Li , Yuchen Tian , Ming Tan , Wasi Uddin Ahmad , Shiqi Wang , Qing Sun , Mingyue Shang , Sujan Kumar Gonugondla , Hantian Ding , Varun Kumar , Nathan Fulton , Arash Farahani , Siddhartha Jain , Robert Giaquinto , Haifeng Qian , Murali Krishna Ramanathan , Ramesh Nallapati , Baishakhi Ray , Parminder Bhatia , Sudipta Sengupta , Dan Roth , Bing Xiang

Jigsaw: Large Language Models meet Program Synthesis

Large pre-trained language models such as GPT-3, Codex, and Google's language model are now capable of generating code from natural language specifications of programmer intent. We view these developments with a mixture of optimism and…

Software Engineering · Computer Science 2021-12-07 Naman Jain , Skanda Vaidyanath , Arun Iyer , Nagarajan Natarajan , Suresh Parthasarathy , Sriram Rajamani , Rahul Sharma

A Systematic Evaluation of Large Language Models of Code

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not…

Programming Languages · Computer Science 2022-05-05 Frank F. Xu , Uri Alon , Graham Neubig , Vincent J. Hellendoorn

CodeApex: A Bilingual Programming Evaluation Benchmark for Large Language Models

With the emergence of Large Language Models (LLMs), there has been a significant improvement in the programming capabilities of models, attracting growing attention from researchers. Evaluating the programming capabilities of LLMs is…

Computation and Language · Computer Science 2024-03-12 Lingyue Fu , Huacan Chai , Shuang Luo , Kounianhua Du , Weiming Zhang , Longteng Fan , Jiayi Lei , Renting Rui , Jianghao Lin , Yuchen Fang , Yifan Liu , Jingkuan Wang , Siyuan Qi , Kangning Zhang , Weinan Zhang , Yong Yu

Toward Effective Secure Code Reviews: An Empirical Study of Security-Related Coding Weaknesses

Identifying security issues early is encouraged to reduce the latent negative impacts on software systems. Code review is a widely-used method that allows developers to manually inspect modified code, catching security issues during a…

Software Engineering · Computer Science 2024-05-10 Wachiraphan Charoenwet , Patanamon Thongtanunam , Van-Thuan Pham , Christoph Treude

CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Benchmarking on HumanEval-X

Large pre-trained code generation models, such as OpenAI Codex, can generate syntax- and function-correct code, making the coding of programmers more productive and our pursuit of artificial general intelligence closer. In this paper, we…

Machine Learning · Computer Science 2024-07-11 Qinkai Zheng , Xiao Xia , Xu Zou , Yuxiao Dong , Shan Wang , Yufei Xue , Zihan Wang , Lei Shen , Andi Wang , Yang Li , Teng Su , Zhilin Yang , Jie Tang

Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs

The lack of transparency about code datasets used to train large language models (LLMs) makes it difficult to detect, evaluate, and mitigate data leakage. We present a perturbation-based method to quantify memorization advantage in code…

Software Engineering · Computer Science 2026-04-16 Djiré Albérick Euraste , Kaboré Abdoul Kader , Jordan Samhi , Earl T. Barr , Jacques Klein , Tegawendé F. Bissyandé

CodeARC: Benchmarking Reasoning Capabilities of LLM Agents for Inductive Program Synthesis

Inductive program synthesis, or programming by example, requires synthesizing functions from input-output examples that generalize to unseen inputs. While large language model agents have shown promise in programming tasks guided by natural…

Programming Languages · Computer Science 2025-08-11 Anjiang Wei , Tarun Suresh , Jiannan Cao , Naveen Kannan , Yuheng Wu , Kai Yan , Thiago S. F. X. Teixeira , Ke Wang , Alex Aiken

Generating High-Precision Feedback for Programming Syntax Errors using Large Language Models

Large language models (LLMs), such as Codex, hold great promise in enhancing programming education by automatically generating feedback for students. We investigate using LLMs to generate feedback for fixing syntax errors in Python…

Programming Languages · Computer Science 2023-05-01 Tung Phung , José Cambronero , Sumit Gulwani , Tobias Kohn , Rupak Majumdar , Adish Singla , Gustavo Soares

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce…

Software Engineering · Computer Science 2024-01-29 Daya Guo , Qihao Zhu , Dejian Yang , Zhenda Xie , Kai Dong , Wentao Zhang , Guanting Chen , Xiao Bi , Y. Wu , Y. K. Li , Fuli Luo , Yingfei Xiong , Wenfeng Liang

Improving Automated Secure Code Reviews: A Synthetic Dataset for Code Vulnerability Flaws

Automation of code reviews using AI models has garnered substantial attention in the software engineering community as a strategy to reduce the cost and effort associated with traditional peer review processes. These models are typically…

Software Engineering · Computer Science 2025-04-24 Leonardo Centellas-Claros , Juan J. Alonso-Lecaros , Juan Pablo Sandoval Alcocer , Andres Neyem