Related papers: Code4ML: a Large-scale Dataset of annotated Machin…

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation.…

Software Engineering · Computer Science 2021-03-17 Shuai Lu , Daya Guo , Shuo Ren , Junjie Huang , Alexey Svyatkovskiy , Ambrosio Blanco , Colin Clement , Dawn Drain , Daxin Jiang , Duyu Tang , Ge Li , Lidong Zhou , Linjun Shou , Long Zhou , Michele Tufano , Ming Gong , Ming Zhou , Nan Duan , Neel Sundaresan , Shao Kun Deng , Shengyu Fu , Shujie Liu

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase…

Software Engineering · Computer Science 2021-08-31 Ruchir Puri , David S. Kung , Geert Janssen , Wei Zhang , Giacomo Domeniconi , Vladimir Zolotov , Julian Dolby , Jie Chen , Mihir Choudhury , Lindsey Decker , Veronika Thost , Luca Buratti , Saurabh Pujar , Shyam Ramji , Ulrich Finkler , Susan Malaika , Frederick Reiss

Ecosystem of Large Language Models for Code

The availability of vast amounts of publicly accessible data of source code and the advances in modern language models, coupled with increasing computational resources, have led to a remarkable surge in the development of large language…

Software Engineering · Computer Science 2024-10-01 Zhou Yang , Jieke Shi , Premkumar Devanbu , David Lo

CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence…

Software Engineering · Computer Science 2023-12-21 Martin Weyssow , Claudio Di Sipio , Davide Di Ruscio , Houari Sahraoui

QDataset: Quantum Datasets for Machine Learning

The availability of large-scale datasets on which to train, benchmark and test algorithms has been central to the rapid development of machine learning as a discipline and its maturity as a research discipline. Despite considerable…

Quantum Physics · Physics 2021-08-17 Elija Perrier , Akram Youssry , Chris Ferrie

CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate…

Software Engineering · Computer Science 2025-02-05 Kunal Pai , Premkumar Devanbu , Toufique Ahmed

JEMMA: An Extensible Java Dataset for ML4Code Applications

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an…

Software Engineering · Computer Science 2022-12-20 Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

Constructing Multilingual Code Search Dataset Using Neural Machine Translation

Code search is a task to find programming codes that semantically match the given natural language queries. Even though some of the existing datasets for this task are multilingual on the programming language side, their query data are only…

Computation and Language · Computer Science 2023-06-28 Ryo Sekizawa , Nan Duan , Shuai Lu , Hitomi Yanaka

StatLLM: A Dataset for Evaluating the Performance of Large Language Models in Statistical Analysis

The coding capabilities of large language models (LLMs) have opened up new opportunities for automatic statistical analysis in machine learning and data science. However, before their widespread adoption, it is crucial to assess the…

Applications · Statistics 2025-02-26 Xinyi Song , Lina Lee , Kexin Xie , Xueying Liu , Xinwei Deng , Yili Hong

Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks

Effective code documentation is essential for collaboration, comprehension, and long-term software maintainability, yet developers often neglect it due to its repetitive nature. Automated documentation generation has evolved from heuristic…

Software Engineering · Computer Science 2026-02-10 Mojtaba Mostafavi Ghahfarokhi , Hamed Jahantigh , Alireza Asadi , Abbas Heydarnoori

Challenges and Barriers of Using Low Code Software for Machine Learning

As big data grows ubiquitous across many domains, more and more stakeholders seek to develop Machine Learning (ML) applications on their data. The success of an ML application usually depends on the close collaboration of ML experts and…

Software Engineering · Computer Science 2022-11-10 Md Abdullah Al Alamin , Gias Uddin

Grounding Data Science Code Generation with Input-Output Specifications

Large language models (LLMs) have recently demonstrated a remarkable ability to generate code from natural language (NL) prompts. However, in the real world, NL is often too ambiguous to capture the true intent behind programming problems,…

Machine Learning · Computer Science 2024-03-18 Yeming Wen , Pengcheng Yin , Kensen Shi , Henryk Michalewski , Swarat Chaudhuri , Alex Polozov

Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources…

Machine Learning · Computer Science 2025-10-28 Amal Abed , Ivan Lukic , Jörg K. H. Franke , Frank Hutter

COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to…

Software Engineering · Computer Science 2025-03-25 Kuldeep Gautam , S. VenkataKeerthy , Ramakrishna Upadrasta

Pitfalls in Language Models for Code Intelligence: A Taxonomy and Survey

Modern language models (LMs) have been successfully employed in source code generation and understanding, leading to a significant increase in research focused on learning-based code intelligence, such as automated bug repair, and test case…

Software Engineering · Computer Science 2023-10-30 Xinyu She , Yue Liu , Yanjie Zhao , Yiling He , Li Li , Chakkrit Tantithamthavorn , Zhan Qin , Haoyu Wang

LLM4DS: Evaluating Large Language Models for Data Science Code Generation

The adoption of Large Language Models (LLMs) for code generation in data science offers substantial potential for enhancing tasks such as data manipulation, statistical analysis, and visualization. However, the effectiveness of these models…

Software Engineering · Computer Science 2024-11-20 Nathalia Nascimento , Everton Guimaraes , Sai Sanjna Chintakunta , Santhosh Anitha Boominathan

CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation

The generation of large, high-quality datasets for code understanding and generation remains a significant challenge, particularly when aligning decompiled binaries with their original source code. To address this, we present CodableLLM, a…

Software Engineering · Computer Science 2025-07-31 Dylan Manuel , Paul Rad

Search Based Code Generation for Machine Learning Programs

Machine Learning (ML) has revamped every domain of life as it provides powerful tools to build complex systems that learn and improve from experience and data. Our key insight is that to solve a machine learning problem, data scientists do…

Software Engineering · Computer Science 2018-02-07 Muhammad Zubair Malik , Muhammad Nawaz , Nimrah Mustafa , Junaid Haroon Siddiqui

NetML: A Challenge for Network Traffic Analytics

Classifying network traffic is the basis for important network applications. Prior research in this area has faced challenges on the availability of representative datasets, and many of the results cannot be readily reproduced. Such a…

Cryptography and Security · Computer Science 2020-04-29 Onur Barut , Yan Luo , Tong Zhang , Weigang Li , Peilong Li

On the use of LLMs to generate a dataset of Neural Networks

Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring,…

Machine Learning · Computer Science 2026-02-05 Nadia Daoudi , Jordi Cabot