Related papers: Magicoder: Empowering Code Generation with OSS-Ins…

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In…

Computation and Language · Computer Science 2025-05-28 Ziyang Luo , Can Xu , Pu Zhao , Qingfeng Sun , Xiubo Geng , Wenxiang Hu , Chongyang Tao , Jing Ma , Qingwei Lin , Daxin Jiang

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

SCoder: Iterative Self-Distillation for Bootstrapping Small-Scale Data Synthesizers to Empower Code LLMs

Existing code large language models (LLMs) often rely on large-scale instruction data distilled from proprietary LLMs for fine-tuning, which typically incurs high costs. In this paper, we explore the potential of small-scale open-source…

Artificial Intelligence · Computer Science 2025-09-10 Xinyu Zhang , Changzhi Zhou , Linmei Hu , Luhao Zhang , Xiancai Chen , Haomin Fu , Yang Yang , Mengdi Zhang

WizardLM: Empowering large pre-trained language models to follow complex instructions

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce…

Computation and Language · Computer Science 2025-05-28 Can Xu , Qingfeng Sun , Kai Zheng , Xiubo Geng , Pu Zhao , Jiazhan Feng , Chongyang Tao , Qingwei Lin , Daxin Jiang

AlchemistCoder: Harmonizing and Eliciting Code Capability by Hindsight Tuning on Multi-source Data

Open-source Large Language Models (LLMs) and their specialized variants, particularly Code LLMs, have recently delivered impressive performance. However, previous Code LLMs are typically fine-tuned on single-source data with limited quality…

Computation and Language · Computer Science 2025-02-04 Zifan Song , Yudong Wang , Wenwei Zhang , Kuikun Liu , Chengqi Lyu , Demin Song , Qipeng Guo , Hang Yan , Dahua Lin , Kai Chen , Cairong Zhao

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a…

Computation and Language · Computer Science 2024-12-17 Yutong Wu , Di Huang , Wenxuan Shi , Wei Wang , Lingzhe Gao , Shihao Liu , Ziyuan Nan , Kaizhao Yuan , Rui Zhang , Xishan Zhang , Zidong Du , Qi Guo , Yewen Pu , Dawei Yin , Xing Hu , Yunji Chen

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly…

Software Engineering · Computer Science 2025-08-11 Wasi Uddin Ahmad , Aleksander Ficek , Mehrzad Samadi , Jocelyn Huang , Vahid Noroozi , Somshubra Majumdar , Boris Ginsburg

OpenMathInstruct-1: A 1.8 Million Math Instruction Tuning Dataset

Recent work has shown the immense potential of synthetically generated datasets for training large language models (LLMs), especially for acquiring targeted skills. Current large-scale math instruction tuning datasets such as MetaMathQA (Yu…

Computation and Language · Computer Science 2024-11-05 Shubham Toshniwal , Ivan Moshkov , Sean Narenthiran , Daria Gitman , Fei Jia , Igor Gitman

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

Recent work demonstrates that, after instruction tuning, Code Large Language Models (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs…

Computation and Language · Computer Science 2024-06-10 Zhaojian Yu , Xin Zhang , Ning Shang , Yangyu Huang , Can Xu , Yishujie Zhao , Wenxiang Hu , Qiufeng Yin

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening…

Computation and Language · Computer Science 2025-06-16 Jijie Li , Li Du , Hanyu Zhao , Bo-wen Zhang , Liangdong Wang , Boyan Gao , Guang Liu , Yonghua Lin

HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data

Large language models (LLMs) have shown great potential for automatic code generation and form the basis for various tools such as GitHub Copilot. However, recent studies highlight that many LLM-generated code contains serious security…

Cryptography and Security · Computer Science 2024-09-11 Hossein Hajipour , Lea Schönherr , Thorsten Holz , Mario Fritz

Seed-Coder: Let the Code Model Curate Data for Itself

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Computation and Language · Computer Science 2025-06-06 ByteDance Seed , Yuyu Zhang , Jing Su , Yifan Sun , Chenguang Xi , Xia Xiao , Shen Zheng , Anxiang Zhang , Kaibo Liu , Daoguang Zan , Tao Sun , Jinhua Zhu , Shulin Xin , Dong Huang , Yetao Bai , Lixin Dong , Chao Li , Jianchong Chen , Hanzhi Zhou , Yifan Huang , Guanghan Ning , Xierui Song , Jiaze Chen , Siyao Liu , Kai Shen , Liang Xiang , Yonghui Wu

OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models

Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary…

Computation and Language · Computer Science 2025-03-21 Siming Huang , Tianhao Cheng , J. K. Liu , Jiaran Hao , Liuyihan Song , Yang Xu , J. Yang , Jiaheng Liu , Chenchen Zhang , Linzheng Chai , Ruifeng Yuan , Zhaoxiang Zhang , Jie Fu , Qian Liu , Ge Zhang , Zili Wang , Yuan Qi , Yinghui Xu , Wei Chu

Meta Large Language Model Compiler: Foundation Models of Compiler Optimization

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of software engineering and coding tasks. However, their application in the domain of code and compiler optimization remains underexplored. Training…

Programming Languages · Computer Science 2024-07-04 Chris Cummins , Volker Seeker , Dejan Grubisic , Baptiste Roziere , Jonas Gehring , Gabriel Synnaeve , Hugh Leather

A Systematic Evaluation of Large Language Models of Code

Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not…

Programming Languages · Computer Science 2022-05-05 Frank F. Xu , Uri Alon , Graham Neubig , Vincent J. Hellendoorn

MCCoder: Streamlining Motion Control with LLM-Assisted Code Generation and Rigorous Verification

Large Language Models (LLMs) have demonstrated significant potential in code generation. However, in the factory automation sector, particularly motion control, manual programming, alongside inefficient and unsafe debugging practices,…

Artificial Intelligence · Computer Science 2025-07-03 Yin Li , Liangwei Wang , Shiyuan Piao , Boo-Ho Yang , Ziyue Li , Wei Zeng , Fugee Tsung

CodecLM: Aligning Language Models with Tailored Synthetic Data

Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor…

Computation and Language · Computer Science 2024-04-10 Zifeng Wang , Chun-Liang Li , Vincent Perot , Long T. Le , Jin Miao , Zizhao Zhang , Chen-Yu Lee , Tomas Pfister

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for…

Computation and Language · Computer Science 2025-05-26 Somshubra Majumdar , Vahid Noroozi , Mehrzad Samadi , Sean Narenthiran , Aleksander Ficek , Wasi Uddin Ahmad , Jocelyn Huang , Jagadeesh Balam , Boris Ginsburg

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer…

Artificial Intelligence · Computer Science 2024-04-16 Indraneil Paul , Goran Glavaš , Iryna Gurevych