English
Related papers

Related papers: CodecLM: Aligning Language Models with Tailored Sy…

200 papers

Instruction tuning is a pivotal technique for aligning large language models (LLMs) with human intentions, safety constraints, and domain-specific requirements. This survey provides a comprehensive overview of the full pipeline,…

Computation and Language · Computer Science 2025-11-20 Xudong Han , Junjie Yang , Tianyang Wang , Ziqian Bi , Xinyuan Song , Junfeng Hao , Junhao Song

Large Language Models (LLMs) require high quality instruction data for effective alignment, particularly in code generation tasks where expert curated datasets are expensive to produce. We present Genetic-Instruct, a scalable algorithm for…

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users,…

Computation and Language · Computer Science 2026-01-30 Ajay Patel , Colin Raffel , Chris Callison-Burch

Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still…

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment…

Computation and Language · Computer Science 2025-11-21 Mihai Nadas , Laura Diosan , Andreea Tomescu

Instruction tuning is a crucial technique for aligning language models with humans' actual goals in the real world. Extensive research has highlighted the quality of instruction data is essential for the success of this alignment. However,…

Artificial Intelligence · Computer Science 2024-10-15 Chenglin Li , Qianglong Chen , Zhi Li , Feng Tao , Yicheng Li , Hao Chen , Fei Yu , Yin Zhang

Instruction-following has emerged as a crucial capability for large language models (LLMs). However, existing approaches often rely on pre-existing documents or external resources to synthesize instruction-following data, which limits their…

Computation and Language · Computer Science 2025-06-12 Tingfeng Hui , Pengyu Zhu , Bowen Ping , Ling Tang , Guanting Dong , Yaqi Zhang , Sen Su

Instruction tuning plays a pivotal role in Code Large Language Models (Code LLMs) for the task of program synthesis. Presently, two dominant paradigms for collecting tuning data are natural-instruct (human-written) and self-instruct…

Computation and Language · Computer Science 2024-03-04 Xianzhen Luo , Qingfu Zhu , Zhiming Zhang , Xu Wang , Qing Yang , Dongliang Xu , Wanxiang Che

Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. However, prompting often leads models to make predictions with lower accuracy compared to finetuning a model…

Computation and Language · Computer Science 2024-08-13 Chenyang Zhao , Xueying Jia , Vijay Viswanathan , Tongshuang Wu , Graham Neubig

Instruction-tuned large language models (LLMs) excel at many tasks but often fail to use external tools due to complicated and unfamiliar syntax constraints. While extensive fine-tuning and prompting can mitigate the issue, these approaches…

Computation and Language · Computer Science 2024-06-05 Kexun Zhang , Hongqiao Chen , Lei Li , William Wang

Instruction tuning is a supervised fine-tuning approach that significantly improves the ability of large language models (LLMs) to follow human instructions. We propose SelfCodeAlign, the first fully transparent and permissive pipeline for…

Computation and Language · Computer Science 2024-11-04 Yuxiang Wei , Federico Cassano , Jiawei Liu , Yifeng Ding , Naman Jain , Zachary Mueller , Harm de Vries , Leandro von Werra , Arjun Guha , Lingming Zhang

With the rapid advancement of Large Language Models (LLMs), the demand for robust instruction-following capabilities in code generation tasks has grown significantly. Code generation not only facilitates faster prototyping and automated…

Software Engineering · Computer Science 2025-08-05 Kaiwen Yan , Hongcheng Guo , Xuanqing Shi , Shaosheng Cao , Donglin Di , Zhoujun Li

Large language models (LLMs) with extended context windows enable tasks requiring extensive information integration but are limited by the scarcity of high-quality, diverse datasets for long-context instruction tuning. Existing data…

Computation and Language · Computer Science 2025-02-25 Jiaxi Li , Xingxing Zhang , Xun Wang , Xiaolong Huang , Li Dong , Liang Wang , Si-Qing Chen , Wei Lu , Furu Wei

Automating the decision of whether a code change requires manual review is vital for maintaining software quality in modern development workflows. However, the emergence of new programming languages and frameworks creates a critical…

Software Engineering · Computer Science 2025-09-08 Yogev Cohen , Dudi Ohayon , Romy Somkin , Yehudit Aperstein , Alexander Apartsin

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

Large language models (LLMs) have shown impressive performance in \emph{code} understanding and generation, making coding tasks a key focus for researchers due to their practical applications and value as a testbed for LLM evaluation. Data…

Code editing encompasses a variety of pragmatic tasks that developers deal with daily. Despite its relevance and practical usefulness, automatic code editing remains an underexplored area in the evolution of deep learning models, partly due…

Computation and Language · Computer Science 2024-02-29 Kaixin Li , Qisheng Hu , Xu Zhao , Hui Chen , Yuxi Xie , Tiedong Liu , Qizhe Xie , Junxian He

Recent work targeting large language models (LLMs) for code generation demonstrated that increasing the amount of training data through synthetic code generation often leads to exceptional performance. In this paper we explore data pruning…

Software Engineering · Computer Science 2024-07-09 Yun-Da Tsai , Mingjie Liu , Haoxing Ren

Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to…

Computation and Language · Computer Science 2024-10-07 Lilian Ngweta , Mayank Agarwal , Subha Maity , Alex Gittens , Yuekai Sun , Mikhail Yurochkin

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

‹ Prev 1 2 3 10 Next ›