Related papers: SteloCoder: a Decoder-Only LLM for Multi-Language …

ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation

Code translation is a crucial activity in the software development and maintenance process, and researchers have recently begun to focus on using pre-trained large language models (LLMs) for code translation. However, existing LLMs only…

Software Engineering · Computer Science 2025-09-30 Minghua He , Yue Chen , Fangkai Yang , Pu Zhao , Wenjie Yin , Yu Kang , Qingwei Lin , Saravan Rajmohan , Dongmei Zhang

MonoCoder: Domain-Specific Code Language Model for HPC Codes and Tasks

With easier access to powerful compute resources, there is a growing trend in AI for software development to develop large language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the…

Programming Languages · Computer Science 2024-09-23 Tal Kadosh , Niranjan Hasabnis , Vy A. Vo , Nadav Schneider , Neva Krien , Mihai Capota , Abdul Wasay , Nesreen Ahmed , Ted Willke , Guy Tamir , Yuval Pinter , Timothy Mattson , Gal Oren

StarCoder: may the source be with you!

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling…

Computation and Language · Computer Science 2023-12-14 Raymond Li , Loubna Ben Allal , Yangtian Zi , Niklas Muennighoff , Denis Kocetkov , Chenghao Mou , Marc Marone , Christopher Akiki , Jia Li , Jenny Chim , Qian Liu , Evgenii Zheltonozhskii , Terry Yue Zhuo , Thomas Wang , Olivier Dehaene , Mishig Davaadorj , Joel Lamy-Poirier , João Monteiro , Oleh Shliazhko , Nicolas Gontier , Nicholas Meade , Armel Zebaze , Ming-Ho Yee , Logesh Kumar Umapathi , Jian Zhu , Benjamin Lipkin , Muhtasham Oblokulov , Zhiruo Wang , Rudra Murthy , Jason Stillerman , Siva Sankalp Patel , Dmitry Abulkhanov , Marco Zocca , Manan Dey , Zhihan Zhang , Nour Fahmy , Urvashi Bhattacharyya , Wenhao Yu , Swayam Singh , Sasha Luccioni , Paulo Villegas , Maxim Kunakov , Fedor Zhdanov , Manuel Romero , Tony Lee , Nadav Timor , Jennifer Ding , Claire Schlesinger , Hailey Schoelkopf , Jan Ebert , Tri Dao , Mayank Mishra , Alex Gu , Jennifer Robinson , Carolyn Jane Anderson , Brendan Dolan-Gavitt , Danish Contractor , Siva Reddy , Daniel Fried , Dzmitry Bahdanau , Yacine Jernite , Carlos Muñoz Ferrandis , Sean Hughes , Thomas Wolf , Arjun Guha , Leandro von Werra , Harm de Vries

BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models

Pre-trained large language models (LLMs) have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular…

Machine Learning · Computer Science 2024-05-22 Xiangru Tang , Bill Qian , Rick Gao , Jiakang Chen , Xinyun Chen , Mark Gerstein

An Empirical Study on the Code Refactoring Capability of Large Language Models

Large Language Models (LLMs) have shown potential to enhance software development through automated code generation and refactoring, reducing development time and improving code quality. This study empirically evaluates StarCoder2, an LLM…

Software Engineering · Computer Science 2024-11-05 Jonathan Cordeiro , Shayan Noei , Ying Zou

WizardCoder: Empowering Code Large Language Models with Evol-Instruct

Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In…

Computation and Language · Computer Science 2025-05-28 Ziyang Luo , Can Xu , Pu Zhao , Qingfeng Sun , Xiubo Geng , Wenxiang Hu , Chongyang Tao , Jing Ma , Qingwei Lin , Daxin Jiang

TreeCoder: Systematic Exploration and Optimisation of Decoding and Constraints for LLM Code Generation

Large language models (LLMs) have shown remarkable ability to generate code, yet their outputs often violate syntactic or semantic constraints when guided only through natural language prompts. We introduce TreeCoder, the most general and…

Machine Learning · Computer Science 2026-04-27 Henrijs Princis , Arindam Sharma , Cristina David

Crystal: Illuminating LLM Abilities on Language and Code

Large Language Models (LLMs) specializing in code generation (which are also often referred to as code LLMs), e.g., StarCoder and Code Llama, play increasingly critical roles in various software development scenarios. It is also crucial for…

Software Engineering · Computer Science 2024-11-08 Tianhua Tao , Junbo Li , Bowen Tan , Hongyi Wang , William Marshall , Bhargav M Kanakiya , Joel Hestness , Natalia Vassilieva , Zhiqiang Shen , Eric P. Xing , Zhengzhong Liu

Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software…

Programming Languages · Computer Science 2024-09-24 Federico Cassano , John Gouwar , Francesca Lucchetti , Claire Schlesinger , Anders Freeman , Carolyn Jane Anderson , Molly Q Feldman , Michael Greenberg , Abhinav Jangda , Arjun Guha

StarCoder 2 and The Stack v2: The Next Generation

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of…

Software Engineering · Computer Science 2024-03-01 Anton Lozhkov , Raymond Li , Loubna Ben Allal , Federico Cassano , Joel Lamy-Poirier , Nouamane Tazi , Ao Tang , Dmytro Pykhtar , Jiawei Liu , Yuxiang Wei , Tianyang Liu , Max Tian , Denis Kocetkov , Arthur Zucker , Younes Belkada , Zijian Wang , Qian Liu , Dmitry Abulkhanov , Indraneil Paul , Zhuang Li , Wen-Ding Li , Megan Risdal , Jia Li , Jian Zhu , Terry Yue Zhuo , Evgenii Zheltonozhskii , Nii Osae Osae Dade , Wenhao Yu , Lucas Krauß , Naman Jain , Yixuan Su , Xuanli He , Manan Dey , Edoardo Abati , Yekun Chai , Niklas Muennighoff , Xiangru Tang , Muhtasham Oblokulov , Christopher Akiki , Marc Marone , Chenghao Mou , Mayank Mishra , Alex Gu , Binyuan Hui , Tri Dao , Armel Zebaze , Olivier Dehaene , Nicolas Patry , Canwen Xu , Julian McAuley , Han Hu , Torsten Scholak , Sebastien Paquet , Jennifer Robinson , Carolyn Jane Anderson , Nicolas Chapados , Mostofa Patwary , Nima Tajbakhsh , Yacine Jernite , Carlos Muñoz Ferrandis , Lingming Zhang , Sean Hughes , Thomas Wolf , Arjun Guha , Leandro von Werra , Harm de Vries

Investigating Decoder-only Large Language Models for Speech-to-text Translation

Large language models (LLMs), known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains, present a promising avenue for enhancing speech-related tasks. In this paper, we focus on integrating…

Computation and Language · Computer Science 2024-07-04 Chao-Wei Huang , Hui Lu , Hongyu Gong , Hirofumi Inaguma , Ilia Kulikov , Ruslan Mavlyutov , Sravya Popuri

UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance

Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. Current approaches for obtaining high-quality code data primarily focus on (i) collecting large-scale…

Computation and Language · Computer Science 2025-02-18 Yichuan Ma , Yunfan Shao , Peiji Li , Demin Song , Qipeng Guo , Linyang Li , Xipeng Qiu , Kai Chen

SemCoder: Training Code Language Models with Comprehensive Semantics Reasoning

Code Large Language Models (Code LLMs) have excelled at tasks like code completion but often miss deeper semantics such as execution effects and dynamic states. This paper aims to bridge the gap between Code LLMs' reliance on static text…

Computation and Language · Computer Science 2024-11-04 Yangruibo Ding , Jinjun Peng , Marcus J. Min , Gail Kaiser , Junfeng Yang , Baishakhi Ray

ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models

Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools. Existing approaches face significant challenges, including reliance on…

Computation and Language · Computer Science 2025-06-02 Hanxing Ding , Shuchang Tao , Liang Pang , Zihao Wei , Jinyang Gao , Bolin Ding , Huawei Shen , Xueqi Cheng

Unraveling the Potential of Large Language Models in Code Translation: How Far Are We?

While large language models (LLMs) exhibit state-of-the-art performance in various tasks, recent studies have revealed their struggle for code translation. This is because they haven't been extensively pre-trained with parallel multilingual…

Software Engineering · Computer Science 2024-10-15 Qingxiao Tao , Tingrui Yu , Xiaodong Gu , Beijun Shen

ShortCoder: Knowledge-Augmented Syntax Optimization for Token-Efficient Code Generation

Code generation tasks aim to automate the conversion of user requirements into executable code, significantly reducing manual development efforts and enhancing software productivity. The emergence of large language models (LLMs) has…

Software Engineering · Computer Science 2026-01-15 Sicong Liu , Yanxian Huang , Mingwei Liu , Jiachi Chen , Ensheng Shi , Yuchi Ma , Hongyu Zhang , Yin Zhang , Yanlin Wang

KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and…

Machine Learning · Computer Science 2024-03-15 Zixuan Li , Yutao Zeng , Yuxin Zuo , Weicheng Ren , Wenxuan Liu , Miao Su , Yucan Guo , Yantao Liu , Xiang Li , Zhilei Hu , Long Bai , Wei Li , Yidan Liu , Pan Yang , Xiaolong Jin , Jiafeng Guo , Xueqi Cheng

MFTCoder: Boosting Code LLMs with Multitask Fine-Tuning

Code LLMs have emerged as a specialized research field, with remarkable studies dedicated to enhancing model's coding capabilities through fine-tuning on pre-trained models. Previous fine-tuning approaches were typically tailored to…

Machine Learning · Computer Science 2023-11-07 Bingchang Liu , Chaoyu Chen , Cong Liao , Zi Gong , Huan Wang , Zhichao Lei , Ming Liang , Dajun Chen , Min Shen , Hailian Zhou , Hang Yu , Jianguo Li

MultiCoder: Multi-Programming-Lingual Pre-Training for Low-Resource Code Completion

Code completion is a valuable topic in both academia and industry. Recently, large-scale mono-programming-lingual (MonoPL) pre-training models have been proposed to boost the performance of code completion. However, the code completion on…

Computation and Language · Computer Science 2022-12-20 Zi Gong , Yinpeng Guo , Pingyi Zhou , Cuiyun Gao , Yasheng Wang , Zenglin Xu

Finetuning Large Language Models for Vulnerability Detection

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for…

Cryptography and Security · Computer Science 2024-07-30 Alexey Shestov , Rodion Levichev , Ravil Mussabayev , Evgeny Maslov , Anton Cheshkov , Pavel Zadorozhny