Related papers: Summarising Big Data: Common GitHub Dataset for So…

Open Source Software Development Challenges: A Systematic Literature Review on GitHub

Git is used as the distributed version control system for many open-source software projects. One Git-based service, GitHub, is the most common code hosting and repository service for open-source software projects. For researchers that…

Software Engineering · Computer Science 2021-01-22 Abdulkadir Şeker , Banu Diri , Halil Arslan , Mehmet Fatih Amasyalı

Open Data on GitHub: Unlocking the Potential of AI

GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes…

Machine Learning · Computer Science 2023-06-13 Anthony Cintron Roman , Kevin Xu , Arfon Smith , Jehu Torres Vega , Caleb Robinson , Juan M Lavista Ferres

Data Engineering for Everyone

Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce,…

Machine Learning · Computer Science 2021-02-24 Vijay Janapa Reddi , Greg Diamos , Pete Warden , Peter Mattson , David Kanter

DataHub: Collaborative Data Science & Dataset Version Management at Scale

Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving…

Databases · Computer Science 2014-09-03 Anant Bhardwaj , Souvik Bhattacherjee , Amit Chavan , Amol Deshpande , Aaron J. Elmore , Samuel Madden , Aditya G. Parameswaran

Editorial: Special Issue on Collaborative Aspects of Open Data in Software EngineeringJohan

High-quality data has become increasingly important to software engineers in designing and implementing today's software, for example, as an input to machine-learning algorithms and visualisation- and analytics-based features. Open data -…

Software Engineering · Computer Science 2022-08-02 Johan Linåker , Per Runeson , Anneke Zuiderwijk , Amanda Brock

SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing

Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical…

Software Engineering · Computer Science 2024-09-30 Ozren Dabić , Rosalia Tufano , Gabriele Bavota

Public Git Archive: a Big Code dataset for all

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research…

Software Engineering · Computer Science 2018-03-28 Vadim Markovtsev , Waren Long

OpenDORS: A dataset of openly referenced open research software

In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software…

Software Engineering · Computer Science 2025-12-02 Stephan Druskat , Lars Grunske

More Effective Software Repository Mining

Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the…

Software Engineering · Computer Science 2020-08-18 Adam Tutko , Austin Henley , Audris Mockus

Datasheets for Datasets

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics…

Databases · Computer Science 2021-12-03 Timnit Gebru , Jamie Morgenstern , Briana Vecchione , Jennifer Wortman Vaughan , Hanna Wallach , Hal Daumé , Kate Crawford

Open Data: Reverse Engineering and Maintenance Perspective

Open data is an emerging paradigm to share large and diverse datasets -- primarily from governmental agencies, but also from other organizations -- with the goal to enable the exploitation of the data for societal, academic, and commercial…

Software Engineering · Computer Science 2012-02-09 Holger M. Kienle

How are Software Repositories Mined? A Systematic Literature Review of Workflows, Methodologies, Reproducibility, and Tools

With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is…

Software Engineering · Computer Science 2022-04-19 Adam Tutko , Austin Z. Henley , Audris Mockus

Data Combination for Problem-solving: A Case of an Open Data Exchange Platform

In recent years, rather than enclosing data within a single organization, exchanging and combining data from different domains has become an emerging practice. Many studies have discussed the economic and utility value of data and data…

Computers and Society · Computer Science 2020-12-23 Teruaki Hayashi , Hiroki Sakaji , Hiroyasu Matsushima , Yoshiaki Fukami , Takumi Shimizu , Yukio Ohsawa

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project's size and structure influence…

Software Engineering · Computer Science 2022-01-13 Christoph Gote , Pavlin Mavrodiev , Frank Schweitzer , Ingo Scholtes

Dataset search: a survey

Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts…

Databases · Computer Science 2022-11-10 Adriane Chapman , Elena Simperl , Laura Koesten , George Konstantinidis , Luis-Daniel Ibáñez-Gonzalez , Emilia Kacprzak , Paul Groth

Accessibility Barriers in Multi-Terabyte Public Datasets: The Gap Between Promise and Practice

The promise of "free and open" multi-terabyte datasets often collides with harsh realities. While these datasets may be technically accessible, practical barriers -- from processing complexity to hidden costs -- create a system that…

Computers and Society · Computer Science 2025-06-17 Marc Bara

An Experimental Survey on Big Data Frameworks

Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-07 Wissem Inoubli , Sabeur Aridhi , Haithem Mezni , Mondher Maddouri , Engelbert Mephu Nguifo

A Versatile Dataset of Agile Open Source Software Projects

Agile software development is nowadays a widely adopted practise in both open-source and industrial software projects. Agile teams typically heavily rely on issue management tools to document new issues and keep track of outstanding ones,…

Software Engineering · Computer Science 2022-02-03 Vali Tawosi , Afnan Al-Subaihin , Rebecca Moussa , Federica Sarro

WikiDataSets: Standardized sub-graphs from Wikidata

Developing new ideas and algorithms in the fields of graph processing and relational learning requires public datasets. While Wikidata is the largest open source knowledge graph, involving more than fifty million entities, it is larger than…

Machine Learning · Computer Science 2019-10-07 Armand Boschin , Thomas Bonald

A Methodology for Using GitLab for Software Engineering Learning Analytics

To bridge the digital skills gap, we need to train more people in Software Engineering techniques. This paper reports on a project exploring the way students solve tasks using collaborative development platforms and version control systems,…

Software Engineering · Computer Science 2019-03-19 Julio César Cortés Ríos , Kamilla Kopec-Harding , Sukru Eraslan , Christopher Page , Robert Haines , Caroline Jay , Suzanne M. Embury