English

OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Artificial Intelligence 2025-02-24 v2 Machine Learning

Abstract

Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. Our dataset is publicly available at https://github.com/AuroraLHL/OptMATH.

Keywords

Cite

@article{arxiv.2502.11102,
  title  = {OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling},
  author = {Hongliang Lu and Zhonglin Xie and Yaoyu Wu and Can Ren and Yuxuan Chen and Zaiwen Wen},
  journal= {arXiv preprint arXiv:2502.11102},
  year   = {2025}
}

Comments

This paper has 36 pages, 18 figures, and two co-first authors: Hongliang Lu and Zhonglin Xie