Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations
Abstract
Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.
Comments: 10 pages, 4 figures
Cite
@article{arxiv.2605.29786,
title = {Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations},
author = {Omar Benjelloun and Leonardo Martins Bianco and Isabelle Guyon and Thanh Gia Hieu Khuong and Jonathan Lebensold and Sebastian Lobentanzer and Luis Oala and Benedictus Kent Rachmat and Ihsan Ullah and Peyman Vahidi and Joaquin Vanschoren},
journal= {arXiv preprint arXiv:2605.29786},
year = {2026}
}