Related papers: InstructCoder: Instruction Tuning Large Language M…

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs

Large Language Models (LLMs) have transformed software development by enabling code generation, automated debugging, and complex reasoning. However, their continued advancement is constrained by the scarcity of high-quality, publicly…

Software Engineering · Computer Science 2025-08-11 Wasi Uddin Ahmad , Aleksander Ficek , Mehrzad Samadi , Jocelyn Huang , Vahid Noroozi , Somshubra Majumdar , Boris Ginsburg

InstructEdit: Instruction-based Knowledge Editing for Large Language Models

Knowledge editing for large language models can offer an efficient solution to alter a model's behavior without negatively impacting the overall performance. However, the current approaches encounter issues with limited generalizability…

Computation and Language · Computer Science 2024-04-30 Ningyu Zhang , Bozhong Tian , Siyuan Cheng , Xiaozhuan Liang , Yi Hu , Kouying Xue , Yanjie Gou , Xi Chen , Huajun Chen

Envisioning Future Interactive Web Development: Editing Webpage with Natural Language

The evolution of web applications relies on iterative code modifications, a process that is traditionally manual and time-consuming. While Large Language Models (LLMs) can generate UI code, their ability to edit existing code from new…

Software Engineering · Computer Science 2025-10-31 Truong Hai Dang , Jingyu Xiao , Yintong Huo

Can It Edit? Evaluating the Ability of Large Language Models to Follow Code Editing Instructions

A significant amount of research is focused on developing and evaluating large language models for a variety of code synthesis tasks. These include synthesizing code from natural language, synthesizing tests from code, and synthesizing…

Software Engineering · Computer Science 2024-09-25 Federico Cassano , Luisa Li , Akul Sethi , Noah Shinn , Abby Brennan-Jones , Jacob Ginesin , Edward Berman , George Chakhnashvili , Anton Lozhkov , Carolyn Jane Anderson , Arjun Guha

WaveCoder: Widespread And Versatile Enhancement For Code Large Language Models By Instruction Tuning

Recent work demonstrates that, after instruction tuning, Code Large Language Models (Code LLMs) can obtain impressive capabilities to address a wide range of code-related tasks. However, current instruction tuning methods for Code LLMs…

Computation and Language · Computer Science 2024-06-10 Zhaojian Yu , Xin Zhang , Ning Shang , Yangyu Huang , Can Xu , Yishujie Zhao , Wenxiang Hu , Qiufeng Yin

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

InverseCoder: Self-improving Instruction-Tuned Code LLMs with Inverse-Instruct

Recent advancements in open-source code large language models (LLMs) have been driven by fine-tuning on the data generated from powerful closed-source LLMs, which are expensive to obtain. This paper explores whether it is possible to use a…

Computation and Language · Computer Science 2024-12-17 Yutong Wu , Di Huang , Wenxuan Shi , Wei Wang , Lingzhe Gao , Shihao Liu , Ziyuan Nan , Kaizhao Yuan , Rui Zhang , Xishan Zhang , Zidong Du , Qi Guo , Yewen Pu , Dawei Yin , Xing Hu , Yunji Chen

Seed-Coder: Let the Code Model Curate Data for Itself

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code…

Computation and Language · Computer Science 2025-06-06 ByteDance Seed , Yuyu Zhang , Jing Su , Yifan Sun , Chenguang Xi , Xia Xiao , Shen Zheng , Anxiang Zhang , Kaibo Liu , Daoguang Zan , Tao Sun , Jinhua Zhu , Shulin Xin , Dong Huang , Yetao Bai , Lixin Dong , Chao Li , Jianchong Chen , Hanzhi Zhou , Yifan Huang , Guanghan Ning , Xierui Song , Jiaze Chen , Siyao Liu , Kai Shen , Liang Xiang , Yonghui Wu

Building Instruction-Tuning Datasets from Human-Written Instructions with Open-Weight Large Language Models

Instruction tuning is crucial for enabling Large Language Models (LLMs) to solve real-world tasks. Prior work has shown the effectiveness of instruction-tuning data synthesized solely from LLMs, raising a fundamental question: Do we still…

Computation and Language · Computer Science 2025-08-15 Youmi Ma , Sakae Mizuki , Kazuki Fujii , Taishi Nakamura , Masanari Ohi , Hinari Shimada , Taihei Shiotani , Koshiro Saito , Koki Maeda , Kakeru Hattori , Takumi Okamoto , Shigeki Ishida , Rio Yokota , Hiroya Takamura , Naoaki Okazaki

OctoPack: Instruction Tuning Code Large Language Models

Finetuning large language models (LLMs) on instructions leads to vast performance improvements on natural language tasks. We apply instruction tuning using code, leveraging the natural structure of Git commits, which pair code changes with…

Computation and Language · Computer Science 2024-02-20 Niklas Muennighoff , Qian Liu , Armel Zebaze , Qinkai Zheng , Binyuan Hui , Terry Yue Zhuo , Swayam Singh , Xiangru Tang , Leandro von Werra , Shayne Longpre

DolphCoder: Echo-Locating Code Large Language Models with Diverse and Multi-Objective Instruction Tuning

Code Large Language Models (Code LLMs) have demonstrated outstanding performance in code-related tasks. Several instruction tuning approaches have been proposed to boost the code generation performance of pre-trained Code LLMs. In this…

Computation and Language · Computer Science 2024-02-15 Yejie Wang , Keqing He , Guanting Dong , Pei Wang , Weihao Zeng , Muxi Diao , Yutao Mou , Mengdi Zhang , Jingang Wang , Xunliang Cai , Weiran Xu

Instruction Mining: Instruction Data Selection for Tuning Large Language Models

Large language models (LLMs) are initially pretrained for broad capabilities and then finetuned with instruction-following datasets to improve their performance in interacting with humans. Despite advances in finetuning, a standardized…

Computation and Language · Computer Science 2024-07-30 Yihan Cao , Yanbin Kang , Chi Wang , Lichao Sun

CodecLM: Aligning Language Models with Tailored Synthetic Data

Instruction tuning has emerged as the key in aligning large language models (LLMs) with specific task instructions, thereby mitigating the discrepancy between the next-token prediction objective and users' actual goals. To reduce the labor…

Computation and Language · Computer Science 2024-04-10 Zifeng Wang , Chun-Liang Li , Vincent Perot , Long T. Le , Jin Miao , Zizhao Zhang , Chen-Yu Lee , Tomas Pfister

Instruction Tuning for Secure Code Generation

Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs'…

Cryptography and Security · Computer Science 2024-07-15 Jingxuan He , Mark Vero , Gabriela Krasnopolska , Martin Vechev

Understanding Robustness of Model Editing in Code LLMs

Large language models (LLMs) for code are increasingly used in software development, but they remain static after pretraining while APIs and software libraries continue to evolve. Model editing offers a lightweight alternative to retraining…

Software Engineering · Computer Science 2026-05-11 Vinaik Chhetri , Moghis Fereidouni , A. B Siddique , Umar Farooq

Refactoring with LLMs: Bridging Human Expertise and Machine Understanding

Code refactoring is a fundamental software engineering practice aimed at improving code quality and maintainability. Despite its importance, developers often neglect refactoring due to the significant time, effort, and resources it…

Software Engineering · Computer Science 2025-10-07 Yonnel Chen Kuang Piao , Jean Carlors Paul , Leuson Da Silva , Arghavan Moradi Dakhel , Mohammad Hamdaqa , Foutse Khomh

ITDR: An Instruction Tuning Dataset for Enhancing Large Language Models in Recommendations

Large language models (LLMs) have demonstrated outstanding performance in natural language processing tasks. However, in the field of recommender systems, due to the inherent structural discrepancy between user behavior data and natural…

Information Retrieval · Computer Science 2026-01-01 Zekun Liu , Xiaowen Huang , Jitao Sang

How Do Your Code LLMs Perform? Empowering Code Instruction Tuning with High-Quality Data

Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models trained with these datasets exhibit high performance on HumanEval but perform worse on other…

Software Engineering · Computer Science 2024-09-09 Yejie Wang , Keqing He , Dayuan Fu , Zhuoma Gongque , Heyang Xu , Yanxu Chen , Zhexu Wang , Yujia Fu , Guanting Dong , Muxi Diao , Jingang Wang , Mengdi Zhang , Xunliang Cai , Weiran Xu

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users,…

Computation and Language · Computer Science 2026-01-30 Ajay Patel , Colin Raffel , Chris Callison-Burch

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and…

Software Engineering · Computer Science 2025-11-18 Wayne Chi , Valerie Chen , Ryan Shar , Aditya Mittal , Jenny Liang , Wei-Lin Chiang , Anastasios Nikolas Angelopoulos , Ion Stoica , Graham Neubig , Ameet Talwalkar , Chris Donahue