Related papers: CodecLM: Aligning Language Models with Tailored Sy…

Towards Alignment-Centric Paradigm: A Survey of Instruction Tuning in Large Language Models

Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline,…

Computation and Language · Computer Science 2025-11-20 Xudong Han , Junjie Yang , Tianyang Wang , Ziqian Bi , Xinyuan Song , Junfeng Hao , Junhao Song

Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for…

Computation and Language · Computer Science 2025-05-26 Somshubra Majumdar , Vahid Noroozi , Mehrzad Samadi , Sean Narenthiran , Aleksander Ficek , Wasi Uddin Ahmad , Jocelyn Huang , Jagadeesh Balam , Boris Ginsburg

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users,…

Computation and Language · Computer Science 2026-01-30 Ajay Patel , Colin Raffel , Chris Callison-Burch

Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still…

Computation and Language · Computer Science 2025-08-15 Youmi Ma , Sakae Mizuki , Kazuki Fujii , Taishi Nakamura , Masanari Ohi , Hinari Shimada , Taihei Shiotani , Koshiro Saito , Koki Maeda , Kakeru Hattori , Takumi Okamoto , Shigeki Ishida , Rio Yokota , Hiroya Takamura , Naoaki Okazaki

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment…

Computation and Language · Computer Science 2025-11-21 Mihai Nadas , Laura Diosan , Andreea Tomescu

Optimizing Instruction Synthesis: Effective Exploration of Evolutionary Space with Tree Search

Instruction tuning is a crucial technique for aligning language models with humans' actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However,…

Artificial Intelligence · Computer Science 2024-10-15 Chenglin Li , Qianglong Chen , Zhi Li , Feng Tao , Yicheng Li , Hao Chen , Fei Yu , Yin Zhang

DecIF: Improving Instruction-Following through Meta-Decomposition

Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their…

Computation and Language · Computer Science 2025-06-12 Tingfeng Hui , Pengyu Zhu , Bowen Ping , Ling Tang , Guanting Dong , Yaqi Zhang , Sen Su

Semi-Instruct: Bridging Natural-Instruct and Self-Instruct for Code Large Language Models

Instruction tuning plays a pivotal role in Code Large Language Models (Code LLMs) for the task of program synthesis. Presently, two dominant paradigms for collecting tuning data are natural-instruct (human-written) and self-instruct…

Computation and Language · Computer Science 2024-03-04 Xianzhen Luo , Qingfu Zhu , Zhiming Zhang , Xu Wang , Qing Yang , Dongliang Xu , Wanxiang Che

SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning

Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. However, prompting often leads models to make predictions with lower accuracy compared to finetuning a model…

Computation and Language · Computer Science 2024-08-13 Chenyang Zhao , Xueying Jia , Vijay Viswanathan , Tongshuang Wu , Graham Neubig

Don't Fine-Tune, Decode: Syntax Error-Free Tool Use via Constrained Decoding

Instruction-tuned large language models (LLMs) excel at many tasks but often fail to use external tools due to complicated and unfamiliar syntax constraints. While extensive fine-tuning and prompting can mitigate the issue, these approaches…

Computation and Language · Computer Science 2024-06-05 Kexun Zhang , Hongqiao Chen , Lei Li , William Wang

SelfCodeAlign: Self-Alignment for Code Generation

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for…

Computation and Language · Computer Science 2024-11-04 Yuxiang Wei , Federico Cassano , Jiawei Liu , Yifeng Ding , Naman Jain , Zachary Mueller , Harm de Vries , Leandro von Werra , Arjun Guha , Lingming Zhang

CodeIF: Benchmarking the Instruction-Following Capabilities of Large Language Models for Code Generation

With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated…

Software Engineering · Computer Science 2025-08-05 Kaiwen Yan , Hongcheng Guo , Xuanqing Shi , Shaosheng Cao , Donglin Di , Zhoujun Li

WildLong: Synthesizing Realistic Long-Context Instruction Data at Scale

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data…

Computation and Language · Computer Science 2025-02-25 Jiaxi Li , Xingxing Zhang , Xun Wang , Xiaolong Huang , Li Dong , Liang Wang , Si-Qing Chen , Wei Lu , Furu Wei

Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation

Automating the decision of whether a code change requires manual review is vital for maintaining software quality in modern development workflows. However, the emergence of new programming languages and frameworks creates a critical…

Software Engineering · Computer Science 2025-09-08 Yogev Cohen , Dudi Ohayon , Romy Somkin , Yehudit Aperstein , Alexander Apartsin

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

Mastering the Craft of Data Synthesis for CodeLLMs

Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data…

Software Engineering · Computer Science 2025-02-10 Meng Chen , Philip Arthur , Qianyu Feng , Cong Duy Vu Hoang , Yu-Heng Hong , Mahdi Kazemi Moghaddam , Omid Nezami , Thien Nguyen , Gioacchino Tangari , Duy Vu , Thanh Vu , Mark Johnson , Krishnaram Kenthapadi , Don Dharmasiri , Long Duong , Yuan-Fang Li

InstructCoder: Instruction Tuning Large Language Models for Code Editing

Code editing encompasses a variety of pragmatic tasks that developers deal with daily. Despite its relevance and practical usefulness, automatic code editing remains an underexplored area in the evolution of deep learning models, partly due…

Computation and Language · Computer Science 2024-02-29 Kaixin Li , Qisheng Hu , Xu Zhao , Hui Chen , Yuxi Xie , Tiedong Liu , Qizhe Xie , Junxian He

Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning…

Software Engineering · Computer Science 2024-07-09 Yun-Da Tsai , Mingjie Liu , Haoxing Ren

Aligners: Decoupling LLMs and Alignment

Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to…

Computation and Language · Computer Science 2024-10-07 Lilian Ngweta , Mayank Agarwal , Subha Maity , Alex Gittens , Yuekai Sun , Mikhail Yurochkin

Seed-Coder: Let the Code Model Curate Data for Itself

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Computation and Language · Computer Science 2025-06-06 ByteDance Seed , Yuyu Zhang , Jing Su , Yifan Sun , Chenguang Xi , Xia Xiao , Shen Zheng , Anxiang Zhang , Kaibo Liu , Daoguang Zan , Tao Sun , Jinhua Zhu , Shulin Xin , Dong Huang , Yetao Bai , Lixin Dong , Chao Li , Jianchong Chen , Hanzhi Zhou , Yifan Huang , Guanghan Ning , Xierui Song , Jiaze Chen , Siyao Liu , Kai Shen , Liang Xiang , Yonghui Wu