English
Related papers

Related papers: Summarising Big Data: Common GitHub Dataset for So…

200 papers

Git is used as the distributed version control system for many open-source software projects. One Git-based service, GitHub, is the most common code hosting and repository service for open-source software projects. For researchers that…

Software Engineering · Computer Science 2021-01-22 Abdulkadir Şeker , Banu Diri , Halil Arslan , Mehmet Fatih Amasyalı

GitHub is the world's largest platform for collaborative software development, with over 100 million users. GitHub is also used extensively for open data collaboration, hosting more than 800 million open data files, totaling 142 terabytes…

Machine Learning · Computer Science 2023-06-13 Anthony Cintron Roman , Kevin Xu , Arfon Smith , Jehu Torres Vega , Caleb Robinson , Juan M Lavista Ferres

Data engineering is one of the fastest-growing fields within machine learning (ML). As ML becomes more common, the appetite for data grows more ravenous. But ML requires more data than individual teams of data engineers can readily produce,…

Machine Learning · Computer Science 2021-02-24 Vijay Janapa Reddi , Greg Diamos , Pete Warden , Peter Mattson , David Kanter

Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving…

High-quality data has become increasingly important to software engineers in designing and implementing today's software, for example, as an input to machine-learning algorithms and visualisation- and analytics-based features. Open data -…

Software Engineering · Computer Science 2022-08-02 Johan Linåker , Per Runeson , Anneke Zuiderwijk , Amanda Brock

Large-scale code datasets have acquired an increasingly central role in software engineering (SE) research. This is the result of (i) the success of the mining software repositories (MSR) community, that pushed the standards of empirical…

Software Engineering · Computer Science 2024-09-30 Ozren Dabić , Rosalia Tufano , Gabriele Bavota

The number of open source software projects has been growing exponentially. The major online software repository host, GitHub, has accumulated tens of millions of publicly available Git version-controlled repositories. Although the research…

Software Engineering · Computer Science 2018-03-28 Vadim Markovtsev , Waren Long

In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software…

Software Engineering · Computer Science 2025-12-02 Stephan Druskat , Lars Grunske

Background: Data mining and analyzing of public Git software repositories is a growing research field. The tools used for studies that investigate a single project or a group of projects have been refined, but it is not clear whether the…

Software Engineering · Computer Science 2020-08-18 Adam Tutko , Austin Henley , Audris Mockus

The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics…

Open data is an emerging paradigm to share large and diverse datasets -- primarily from governmental agencies, but also from other organizations -- with the goal to enable the exploitation of the data for societal, academic, and commercial…

Software Engineering · Computer Science 2012-02-09 Holger M. Kienle

With the advent of open source software, a veritable treasure trove of previously proprietary software development data was made available. This opened the field of empirical software engineering research to anyone in academia. Data that is…

Software Engineering · Computer Science 2022-04-19 Adam Tutko , Austin Z. Henley , Audris Mockus

In recent years, rather than enclosing data within a single organization, exchanging and combining data from different domains has become an emerging practice. Many studies have discussed the economic and utility value of data and data…

Computers and Society · Computer Science 2020-12-23 Teruaki Hayashi , Hiroki Sakaji , Hiroyasu Matsushima , Yoshiaki Fukami , Takumi Shimizu , Yukio Ohsawa

Massive data from software repositories and collaboration tools are widely used to study social aspects in software development. One question that several recent works have addressed is how a software project's size and structure influence…

Software Engineering · Computer Science 2022-01-13 Christoph Gote , Pavlin Mavrodiev , Frank Schweitzer , Ingo Scholtes

Generating value from data requires the ability to find, access and make sense of datasets. There are many efforts underway to encourage data sharing and reuse, from scientific publishers asking authors to submit data alongside manuscripts…

The promise of "free and open" multi-terabyte datasets often collides with harsh realities. While these datasets may be technically accessible, practical barriers -- from processing complexity to hidden costs -- create a system that…

Computers and Society · Computer Science 2025-06-17 Marc Bara

Recently, increasingly large amounts of data are generated from a variety of sources. Existing data processing technologies are not suitable to cope with the huge amounts of generated data. Yet, many research works focus on Big Data, a…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-06-07 Wissem Inoubli , Sabeur Aridhi , Haithem Mezni , Mondher Maddouri , Engelbert Mephu Nguifo

Agile software development is nowadays a widely adopted practise in both open-source and industrial software projects. Agile teams typically heavily rely on issue management tools to document new issues and keep track of outstanding ones,…

Software Engineering · Computer Science 2022-02-03 Vali Tawosi , Afnan Al-Subaihin , Rebecca Moussa , Federica Sarro

Developing new ideas and algorithms in the fields of graph processing and relational learning requires public datasets. While Wikidata is the largest open source knowledge graph, involving more than fifty million entities, it is larger than…

Machine Learning · Computer Science 2019-10-07 Armand Boschin , Thomas Bonald

To bridge the digital skills gap, we need to train more people in Software Engineering techniques. This paper reports on a project exploring the way students solve tasks using collaborative development platforms and version control systems,…

‹ Prev 1 2 3 10 Next ›