OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Wasi Uddin Ahmad; Aleksander Ficek; Mehrzad Samadi; Jocelyn Huang; Vahid Noroozi; Somshubra Majumdar; Boris Ginsburg

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Software Engineering 2025-08-11 v2 Computation and Language

Authors: Wasi Uddin Ahmad , Aleksander Ficek , Mehrzad Samadi , Jocelyn Huang , Vahid Noroozi , Somshubra Majumdar , Boris Ginsburg

View on arXiv ↗ PDF ↗

Abstract

Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly available supervised fine-tuning (SFT) datasets tailored for coding tasks. To bridge this gap, we introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset. Comprehensive evaluations on popular benchmarks (HumanEval, MBPP, LiveCodeBench, and BigCodeBench) demonstrate substantial performance improvements achieved by SFT with OpenCodeInstruct. We also present a detailed methodology encompassing seed data curation, synthetic instruction and solution generation, and filtering.

Keywords

instruction tuning supervised fine-tuning code generation

Cite

@article{arxiv.2504.04030,
  title  = {OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs},
  author = {Wasi Uddin Ahmad and Aleksander Ficek and Mehrzad Samadi and Jocelyn Huang and Vahid Noroozi and Somshubra Majumdar and Boris Ginsburg},
  journal= {arXiv preprint arXiv:2504.04030},
  year   = {2025}
}

Comments

Work in progress

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Abstract

Keywords

Cite

Comments

Related papers