Related papers: LogicPro: Improving Complex Logical Reasoning via …

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate…

Artificial Intelligence · Computer Science 2026-05-11 Yongxian Wei , Yilin Zhao , Zixuan Hu , Li Shen , Xinrui Chen , Runxi Cheng , Sinan Du , Hao Yu , Chun Yuan , Dian Li

Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources…

Machine Learning · Computer Science 2025-10-28 Amal Abed , Ivan Lukic , Jörg K. H. Franke , Frank Hutter

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

Large language models have achieved substantial progress in mathematical reasoning, yet their advancement is limited by the scarcity of high-quality, high-difficulty training data. Existing synthesis methods largely rely on transforming…

Computation and Language · Computer Science 2026-03-10 Shaoxiong Zhan , Yanlin Lai , Ziyu Lu , Dahua Lin , Ziqing Yang , Fei Tan

Synthesis by Design: Controlled Data Generation via Structural Guidance

Mathematical reasoning remains challenging for LLMs due to complex logic and the need for precise computation. Existing methods enhance LLM reasoning by synthesizing datasets through problem rephrasing, but face issues with generation…

Computation and Language · Computer Science 2025-06-12 Lei Xu , Sirui Chen , Yuxuan Huang , Chaochao Lu

Learning from Reasoning Failures via Synthetic Data Generation

Training models on synthetic data has emerged as an increasingly important strategy for improving the performance of generative AI. This approach is particularly helpful for large multimodal models (LMMs) due to the relative scarcity of…

Artificial Intelligence · Computer Science 2026-01-13 Gabriela Ben Melech Stan , Estelle Aflalo , Avinash Madasu , Vasudev Lal , Phillip Howard

LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training.…

Computation and Language · Computer Science 2025-10-14 Yiwei Liu , Yucheng Li , Xiao Li , Gong Cheng

PromptCoT: Synthesizing Olympiad-level Problems for Mathematical Reasoning in Large Language Models

The ability of large language models to solve complex mathematical problems has progressed significantly, particularly for tasks requiring advanced reasoning. However, the scarcity of sufficiently challenging problems, particularly at the…

Computation and Language · Computer Science 2025-12-23 Xueliang Zhao , Wei Wu , Jian Guan , Lingpeng Kong

Arrows of Math Reasoning Data Synthesis for Large Language Models: Diversity, Complexity and Correctness

Enhancing the mathematical reasoning of large language models (LLMs) demands high-quality training data, yet conventional methods face critical challenges in scalability, cost, and data reliability. To address these limitations, we propose…

Computation and Language · Computer Science 2025-08-27 Sirui Chen , Changxin Tian , Binbin Hu , Kunlong Chen , Ziqi Liu , Zhiqiang Zhang , Jun Zhou

MathMixup: Boosting LLM Mathematical Reasoning with Difficulty-Controllable Data Synthesis and Curriculum Learning

In mathematical reasoning tasks, the advancement of Large Language Models (LLMs) relies heavily on high-quality training data with clearly defined and well-graded difficulty levels. However, existing data synthesis methods often suffer from…

Machine Learning · Computer Science 2026-01-27 Xuchen Li , Jing Chen , Xuzhao Li , Hao Liang , Xiaohuan Zhou , Taifeng Wang , Wentao Zhang

Agentic Proposing: Enhancing Large Language Model Reasoning via Compositional Skill Synthesis

Advancing complex reasoning in large language models relies on high-quality, verifiable datasets, yet human annotation remains cost-prohibitive and difficult to scale. Current synthesis paradigms often face a recurring trade-off:…

Artificial Intelligence · Computer Science 2026-02-04 Zhengbo Jiao , Shaobo Wang , Zifan Zhang , Xuan Ren , Wei Wang , Bing Zhao , Hu Wei , Linfeng Zhang

Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive…

Computation and Language · Computer Science 2025-10-17 Kedi Chen , Zhikai Lei , Xu Guo , Xuecheng Wu , Siyuan Zeng , Jianghao Yin , Yinqi Zhang , Qin Chen , Jie Zhou , Liang He , Qipeng Guo , Kai Chen , Wei Zhang

Unleashing LLM Reasoning Capability via Scalable Question Synthesis from Scratch

Improving the mathematical reasoning capabilities of Large Language Models (LLMs) is critical for advancing artificial intelligence. However, access to extensive, diverse, and high-quality reasoning datasets remains a significant challenge,…

Computation and Language · Computer Science 2025-05-28 Yuyang Ding , Xinyu Shi , Xiaobo Liang , Juntao Li , Zhaopeng Tu , Qiaoming Zhu , Min Zhang

Code-Driven Inductive Synthesis: Enhancing Reasoning Abilities of Large Language Models with Sequences

Large language models make remarkable progress in reasoning capabilities. Existing works focus mainly on deductive reasoning tasks (e.g., code and math), while another type of reasoning mode that better aligns with human learning, inductive…

Computation and Language · Computer Science 2025-03-18 Kedi Chen , Zhikai Lei , Fan Zhang , Yinqi Zhang , Qin Chen , Jie Zhou , Liang He , Qipeng Guo , Kai Chen , Wei Zhang

Grammar Filtering For Syntax-Guided Synthesis

Programming-by-example (PBE) is a synthesis paradigm that allows users to generate functions by simply providing input-output examples. While a promising interaction paradigm, synthesis is still too slow for realtime interaction and more…

Machine Learning · Computer Science 2020-02-10 Kairo Morton , William Hallahan , Elven Shum , Ruzica Piskac , Mark Santolucito

MindLoom: Composing Thought Modes for Frontier-Level Reasoning Data Synthesis

Although LLMs have made substantial progress in reasoning, systematically producing frontier-level reasoning data remains difficult. Existing synthesis methods often have limited visibility into the structural factors that govern problem…

Artificial Intelligence · Computer Science 2026-05-22 Haiyang Shen , Taian Guo , Xuanzhong Chen , Mugeng Liu , Weichen Bi , Wenchun Jing , Sixiong Xie , Zhuofan Shi , Yudong Han , Chongyang Pan , Siqi Zhong , Jinsheng Huang , Ming Zhang , Yun Ma

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled…

Computation and Language · Computer Science 2026-04-16 Joel Niklaus , Atsuki Yamaguchi , Michal Štefánik , Guilherme Penedo , Hynek Kydlíček , Elie Bakouch , Lewis Tunstall , Edward Emanuel Beeching , Thibaud Frere , Colin Raffel , Leandro von Werra , Thomas Wolf

LeanReasoner: Boosting Complex Logical Reasoning with Lean

Large language models (LLMs) often struggle with complex logical reasoning due to logical inconsistencies and the inherent difficulty of such reasoning. We use Lean, a theorem proving framework, to address these challenges. By formalizing…

Computation and Language · Computer Science 2024-03-21 Dongwei Jiang , Marcio Fonseca , Shay B. Cohen

Learning from Synthetic Data Improves Multi-hop Reasoning

Reinforcement Learning (RL) has been shown to significantly boost reasoning capabilities of large language models (LLMs) in math, coding, and multi-hop reasoning tasks. However, RL fine-tuning requires abundant high-quality verifiable data,…

Machine Learning · Computer Science 2026-03-03 Anmol Kabra , Yilun Yin , Albert Gong , Kamilė Stankevičiūtė , Dongyoung Go , Johann Lee , Katie Z. Luo , Carla P. Gomes , Kilian Q. Weinberger

Scaling Synthetic Logical Reasoning Datasets with Context-Sensitive Declarative Grammars

Logical reasoning remains a challenge for natural language processing, but it can be improved by training language models to mimic theorem provers on procedurally generated problems. Previous work used domain-specific proof generation…

Computation and Language · Computer Science 2024-06-18 Damien Sileo

DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data

Proof assistants like Lean have revolutionized mathematical proof verification, ensuring high accuracy and reliability. Although large language models (LLMs) show promise in mathematical reasoning, their advancement in formal theorem…

Artificial Intelligence · Computer Science 2024-05-24 Huajian Xin , Daya Guo , Zhihong Shao , Zhizhou Ren , Qihao Zhu , Bo Liu , Chong Ruan , Wenda Li , Xiaodan Liang