Related papers: SEART Data Hub: Streamlining Large-Scale Source Co…

Summarising Big Data: Common GitHub Dataset for Software Engineering Challenges

In open-source software development environments; textual, numerical and relationship-based data generated are of interest to researchers. Various data sets are available for this data, which is frequently used in areas such as software…

Software Engineering · Computer Science 2020-10-01 Abdulkadir Şeker , Banu Diri , Halil Arslan

Seed-Coder: Let the Code Model Curate Data for Itself

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Computation and Language · Computer Science 2025-06-06 ByteDance Seed , Yuyu Zhang , Jing Su , Yifan Sun , Chenguang Xi , Xia Xiao , Shen Zheng , Anxiang Zhang , Kaibo Liu , Daoguang Zan , Tao Sun , Jinhua Zhu , Shulin Xin , Dong Huang , Yetao Bai , Lixin Dong , Chao Li , Jianchong Chen , Hanzhi Zhou , Yifan Huang , Guanghan Ning , Xierui Song , Jiaze Chen , Siyao Liu , Kai Shen , Liang Xiang , Yonghui Wu

Rethinking Dataset Discovery with DataScout

Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is…

Human-Computer Interaction · Computer Science 2025-07-28 Rachel Lin , Bhavya Chopra , Wenjing Lin , Shreya Shankar , Madelon Hulsebos , Aditya G. Parameswaran

On Inter-dataset Code Duplication and Data Leakage in Large Language Models

Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets…

Software Engineering · Computer Science 2024-08-02 José Antonio Hernández López , Boqi Chen , Mootez Saaz , Tushar Sharma , Dániel Varró

Deep Learning Meets Software Engineering: A Survey on Pre-Trained Models of Source Code

Recent years have seen the successful application of deep learning to software engineering (SE). In particular, the development and use of pre-trained models of source code has enabled state-of-the-art results to be achieved on a wide…

Software Engineering · Computer Science 2022-05-25 Changan Niu , Chuanyi Li , Bin Luo , Vincent Ng

SECRET: Towards Scalable and Efficient Code Retrieval via Segmented Deep Hashing

Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval…

Software Engineering · Computer Science 2024-12-17 Wenchao Gu , Ensheng Shi , Yanlin Wang , Lun Du , Shi Han , Hongyu Zhang , Dongmei Zhang , Michael R. Lyu

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi

Processing Large Datasets of Fined Grained Source Code Changes

In the era of Big Code, when researchers seek to study an increasingly large number of repositories to support their findings, the data processing stage may require manipulating millions and more of records. In this work we focus on studies…

Software Engineering · Computer Science 2019-10-22 Stanislav Levin , Amiram Yehudai

PTMTorrent: A Dataset for Mining Open-source Pre-trained Model Packages

Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as "model hubs" support…

Software Engineering · Computer Science 2023-03-17 Wenxin Jiang , Nicholas Synovic , Purvish Jajal , Taylor R. Schorlemmer , Arav Tewari , Bhavesh Pareek , George K. Thiruvathukal , James C. Davis

Data-Driven Search-based Software Engineering

This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as…

Software Engineering · Computer Science 2020-08-31 Vivek Nair , Amritanshu Agrawal , Jianfeng Chen , Wei Fu , George Mathew , Tim Menzies , Leandro Minku , Markus Wagner , Zhe Yu

Source Code Data Augmentation for Deep Learning: A Survey

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and…

Computation and Language · Computer Science 2023-11-14 Terry Yue Zhuo , Zhou Yang , Zhensu Sun , Yufei Wang , Li Li , Xiaoning Du , Zhenchang Xing , David Lo

Integrating Large Language Models in Software Engineering Education: A Pilot Study through GitHub Repositories Mining

Context: Large Language Models (LLMs) such as ChatGPT are increasingly adopted in software engineering (SE) education, offering both opportunities and challenges. Their adoption requires systematic investigation to ensure responsible…

Software Engineering · Computer Science 2025-09-08 Maryam Khan , Muhammad Azeem Akbar , Jussi Kasurinen

Sampling Projects in GitHub for MSR Studies

Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria…

Software Engineering · Computer Science 2021-03-09 Ozren Dabic , Emad Aghajani , Gabriele Bavota

API-guided Dataset Synthesis to Finetune Large Code Models

Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific…

Software Engineering · Computer Science 2024-08-23 Zongjie Li , Daoyuan Wu , Shuai Wang , Zhendong Su

LeetCodeDataset: A Temporal Dataset for Robust Evaluation and Efficient Training of Code LLMs

We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds.…

Machine Learning · Computer Science 2025-04-22 Yunhui Xia , Wei Shen , Yan Wang , Jason Klein Liu , Huifeng Sun , Siyue Wu , Jian Hu , Xiaolong Xu

CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase…

Software Engineering · Computer Science 2021-08-31 Ruchir Puri , David S. Kung , Geert Janssen , Wei Zhang , Giacomo Domeniconi , Vladimir Zolotov , Julian Dolby , Jie Chen , Mihir Choudhury , Lindsey Decker , Veronika Thost , Luca Buratti , Saurabh Pujar , Shyam Ramji , Ulrich Finkler , Susan Malaika , Frederick Reiss

Making the most of small Software Engineering datasets with modern machine learning

This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software…

Software Engineering · Computer Science 2021-06-30 Julian Aron Prenner , Romain Robbes

SEER: Enhancing Chain-of-Thought Code Generation through Self-Exploring Deep Reasoning

Code generation, the task of creating executable programs from natural language requirements, has recently seen tremendous advances through Chain-of-Thought (CoT) reasoning, which enables Large Language Models (LLMs) to develop high-level…

Software Engineering · Computer Science 2025-10-21 Shuzheng Gao , Chaozheng Wang , Cuiyun Gao , Michael R. Lyu

GIRT-Data: Sampling GitHub Issue Report Templates

GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs,…

Software Engineering · Computer Science 2023-03-22 Nafiseh Nikeghbal , Amir Hossein Kargaran , Abbas Heydarnoori , Hinrich Schütze

RevMine: An LLM-Assisted Tool for Code Review Mining and Analysis Across Git Platforms

Empirical research on code review processes is increasingly central to understanding software quality and collaboration. However, collecting and analyzing review data remains a time-consuming and technically intensive task. Most researchers…

Software Engineering · Computer Science 2025-10-07 Samah Kansab , Francis Bordeleau , Ali Tizghadam