English
Related papers

Related papers: SEART Data Hub: Streamlining Large-Scale Source Co…

200 papers

In open-source software development environments; textual, numerical and relationship-based data generated are of interest to researchers. Various data sets are available for this data, which is frequently used in areas such as software…

Software Engineering · Computer Science 2020-10-01 Abdulkadir Şeker , Banu Diri , Halil Arslan

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is…

Human-Computer Interaction · Computer Science 2025-07-28 Rachel Lin , Bhavya Chopra , Wenjing Lin , Shreya Shankar , Madelon Hulsebos , Aditya G. Parameswaran

Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets…

Software Engineering · Computer Science 2024-08-02 José Antonio Hernández López , Boqi Chen , Mootez Saaz , Tushar Sharma , Dániel Varró

Recent years have seen the successful application of deep learning to software engineering (SE). In particular, the development and use of pre-trained models of source code has enabled state-of-the-art results to be achieved on a wide…

Software Engineering · Computer Science 2022-05-25 Changan Niu , Chuanyi Li , Bin Luo , Vincent Ng

Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval…

Software Engineering · Computer Science 2024-12-17 Wenchao Gu , Ensheng Shi , Yanlin Wang , Lun Du , Shi Han , Hongyu Zhang , Dongmei Zhang , Michael R. Lyu

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…

Software Engineering · Computer Science 2020-08-19 Aditya Kanade , Petros Maniatis , Gogul Balakrishnan , Kensen Shi

In the era of Big Code, when researchers seek to study an increasingly large number of repositories to support their findings, the data processing stage may require manipulating millions and more of records. In this work we focus on studies…

Software Engineering · Computer Science 2019-10-22 Stanislav Levin , Amiram Yehudai

Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as "model hubs" support…

This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as…

Software Engineering · Computer Science 2020-08-31 Vivek Nair , Amritanshu Agrawal , Jianfeng Chen , Wei Fu , George Mathew , Tim Menzies , Leandro Minku , Markus Wagner , Zhe Yu

The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and…

Computation and Language · Computer Science 2023-11-14 Terry Yue Zhuo , Zhou Yang , Zhensu Sun , Yufei Wang , Li Li , Xiaoning Du , Zhenchang Xing , David Lo

Context: Large Language Models (LLMs) such as ChatGPT are increasingly adopted in software engineering (SE) education, offering both opportunities and challenges. Their adoption requires systematic investigation to ensure responsible…

Software Engineering · Computer Science 2025-09-08 Maryam Khan , Muhammad Azeem Akbar , Jussi Kasurinen

Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria…

Software Engineering · Computer Science 2021-03-09 Ozren Dabic , Emad Aghajani , Gabriele Bavota

Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific…

Software Engineering · Computer Science 2024-08-23 Zongjie Li , Daoyuan Wu , Shuai Wang , Zhendong Su

We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds.…

Machine Learning · Computer Science 2025-04-22 Yunhui Xia , Wei Shen , Yan Wang , Jason Klein Liu , Huifeng Sun , Siyue Wu , Jian Hu , Xiaolong Xu

Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase…

This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software…

Software Engineering · Computer Science 2021-06-30 Julian Aron Prenner , Romain Robbes

Code generation, the task of creating executable programs from natural language requirements, has recently seen tremendous advances through Chain-of-Thought (CoT) reasoning, which enables Large Language Models (LLMs) to develop high-level…

Software Engineering · Computer Science 2025-10-21 Shuzheng Gao , Chaozheng Wang , Cuiyun Gao , Michael R. Lyu

GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs,…

Software Engineering · Computer Science 2023-03-22 Nafiseh Nikeghbal , Amir Hossein Kargaran , Abbas Heydarnoori , Hinrich Schütze

Empirical research on code review processes is increasingly central to understanding software quality and collaboration. However, collecting and analyzing review data remains a time-consuming and technically intensive task. Most researchers…

Software Engineering · Computer Science 2025-10-07 Samah Kansab , Francis Bordeleau , Ali Tizghadam
‹ Prev 1 2 3 10 Next ›