Related papers: SEART Data Hub: Streamlining Large-Scale Source Co…
In open-source software development environments; textual, numerical and relationship-based data generated are of interest to researchers. Various data sets are available for this data, which is frequently used in areas such as software…
Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…
Dataset Search -- the process of finding appropriate datasets for a given task -- remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is…
Motivation. Large language models (LLMs) have exhibited remarkable proficiency in diverse software engineering (SE) tasks. Handling such tasks typically involves acquiring foundational coding knowledge on large, general-purpose datasets…
Recent years have seen the successful application of deep learning to software engineering (SE). In particular, the development and use of pre-trained models of source code has enabled state-of-the-art results to be achieved on a wide…
Code retrieval, which retrieves code snippets based on users' natural language descriptions, is widely used by developers and plays a pivotal role in real-world software development. The advent of deep learning has shifted the retrieval…
Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come…
In the era of Big Code, when researchers seek to study an increasingly large number of repositories to support their findings, the data processing stage may require manipulating millions and more of records. In this work we focus on studies…
Due to the cost of developing and training deep learning models from scratch, machine learning engineers have begun to reuse pre-trained models (PTMs) and fine-tune them for downstream tasks. PTM registries known as "model hubs" support…
This paper introduces Data-Driven Search-based Software Engineering (DSE), which combines insights from Mining Software Repositories (MSR) and Search-based Software Engineering (SBSE). While MSR formulates software engineering problems as…
The increasingly popular adoption of deep learning models in many critical source code tasks motivates the development of data augmentation (DA) techniques to enhance training data and improve various capabilities (e.g., robustness and…
Context: Large Language Models (LLMs) such as ChatGPT are increasingly adopted in software engineering (SE) education, offering both opportunities and challenges. Their adoption requires systematic investigation to ensure responsible…
Almost every Mining Software Repositories (MSR) study requires, as first step, the selection of the subject software repositories. These repositories are usually collected from hosting services like GitHub using specific selection criteria…
Large code models (LCMs), pre-trained on vast code corpora, have demonstrated remarkable performance across a wide array of code-related tasks. Supervised fine-tuning (SFT) plays a vital role in aligning these models with specific…
We introduce LeetCodeDataset, a high-quality benchmark for evaluating and training code-generation models, addressing two key challenges in LLM research: the lack of reasoning-focused coding benchmarks and self-contained training testbeds.…
Over the last several decades, software has been woven into the fabric of every aspect of our society. As software development surges and code infrastructure of enterprise applications ages, it is now more critical than ever to increase…
This paper provides a starting point for Software Engineering (SE) researchers and practitioners faced with the problem of training machine learning models on small datasets. Due to the high costs associated with labeling data, in Software…
Code generation, the task of creating executable programs from natural language requirements, has recently seen tremendous advances through Chain-of-Thought (CoT) reasoning, which enables Large Language Models (LLMs) to develop high-level…
GitHub's issue reports provide developers with valuable information that is essential to the evolution of a software development project. Contributors can use these reports to perform software engineering tasks like submitting bugs,…
Empirical research on code review processes is increasingly central to understanding software quality and collaboration. However, collecting and analyzing review data remains a time-consuming and technically intensive task. Most researchers…