English

COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging

Software Engineering 2025-03-25 v1

Abstract

In recent years, a lot of technological advances in computer science have aided software programmers to create innovative and real-time user-friendly software. With the creation of the software and the urging interest of people to learn to write software, there is a large collection of source codes that can be found on the web, also known as Big Code, which can be used as a source of data for driving the machine learning applications tending to solve certain software engineering problems. In this paper, we present COFO, a dataset consisting of 809 classes/problems with a total of 369K source codes written in C, C++, Java, and Python programming languages, along with other metadata such as code tags, problem specification, and input-output specifications. COFO has been scraped from the openly available Codeforces website using a selenium-beautifulsoup-python based scraper. We envision that this dataset can be useful for solving machine learning-based problems like program classification/recognition, tagging, predicting program properties, and code comprehension.

Keywords

Cite

@article{arxiv.2503.18251,
  title  = {COFO: COdeFOrces dataset for Program Classification, Recognition and Tagging},
  author = {Kuldeep Gautam and S. VenkataKeerthy and Ramakrishna Upadrasta},
  journal= {arXiv preprint arXiv:2503.18251},
  year   = {2025}
}