Related papers: Code Less, Align More: Efficient LLM Fine-tuning f…

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically…

Computation and Language · Computer Science 2025-04-18 Weijie Lv , Xuan Xia , Sheng-Jun Huang

On the Effectiveness of Training Data Optimization for LLM-based Code Generation: An Empirical Study

Large language models (LLMs) have achieved remarkable progress in code generation, largely driven by the availability of high-quality code datasets for effective training. To further improve data quality, numerous training data optimization…

Software Engineering · Computer Science 2026-01-01 Shiqi Kuang , Zhao Tian , Tao Xiao , Dong Wang , Junjie Chen

LLM-Assisted Code Cleaning For Training Accurate Code Generators

Natural language to code generation is an important application area of LLMs and has received wide attention from the community. The majority of relevant studies have exclusively concentrated on increasing the quantity and functional…

Machine Learning · Computer Science 2023-11-28 Naman Jain , Tianjun Zhang , Wei-Lin Chiang , Joseph E. Gonzalez , Koushik Sen , Ion Stoica

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Large volumes of text data have contributed significantly to the development of large language models (LLMs) in recent years. This data is typically acquired by scraping the internet, leading to pretraining datasets comprised of noisy web…

Computation and Language · Computer Science 2023-09-12 Max Marion , Ahmet Üstün , Luiza Pozzobon , Alex Wang , Marzieh Fadaee , Sara Hooker

Less is More: Towards Green Code Large Language Models via Unified Structural Pruning

The extensive application of Large Language Models (LLMs) in generative coding tasks has raised concerns due to their high computational demands and energy consumption. Unlike previous structural pruning methods designed for classification…

Software Engineering · Computer Science 2025-04-25 Guang Yang , Yu Zhou , Xiangyu Zhang , Wei Cheng , Ke Liu , Xiang Chen , Terry Yue Zhuo , Taolue Chen

Enhancing Code Generation for Low-Resource Languages: No Silver Bullet

The advent of Large Language Models (LLMs) has significantly advanced the field of automated code generation. LLMs rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource…

Software Engineering · Computer Science 2025-02-03 Alessandro Giagnorio , Alberto Martin-Lopez , Gabriele Bavota

Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection

Recent advancements in large language models (LLMs) have significantly improved code generation and program comprehension, accelerating the evolution of software engineering. Current methods primarily enhance model performance by leveraging…

Computation and Language · Computer Science 2025-07-04 Weijie Lyu , Sheng-Jun Huang , Xuan Xia

Brevity is the soul of wit: Pruning long files for code generation

Data curation is commonly considered a "secret-sauce" for LLM training, with higher quality data usually leading to better LLM performance. Given the scale of internet-scraped corpora, data pruning has become a larger and larger focus.…

Computation and Language · Computer Science 2024-07-02 Aaditya K. Singh , Yu Yang , Kushal Tirumala , Mostafa Elhoushi , Ari S. Morcos

Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of Large Language Models

The in-context learning ability of large language models (LLMs) enables them to generalize to novel downstream tasks with relatively few labeled examples. However, they require enormous computational resources to be deployed. Alternatively,…

Computation and Language · Computer Science 2024-01-09 Jean Kaddour , Qi Liu

Beware of Calibration Data for Pruning Large Language Models

As large language models (LLMs) are widely applied across various fields, model compression has become increasingly crucial for reducing costs and improving inference efficiency. Post-training pruning is a promising method that does not…

Computation and Language · Computer Science 2025-07-01 Yixin Ji , Yang Xiang , Juntao Li , Qingrong Xia , Ping Li , Xinyu Duan , Zhefeng Wang , Min Zhang

Improving the Ability of Pre-trained Language Model by Imparting Large Language Model's Experience

Large Language Models (LLMs) and pre-trained Language Models (LMs) have achieved impressive success on many software engineering tasks (e.g., code completion and code generation). By leveraging huge existing code corpora (e.g., GitHub),…

Software Engineering · Computer Science 2025-01-16 Xin Yin , Chao Ni , Xiaodan Xu , Xinrui Li , Xiaohu Yang

Fine-tuning Large Language Models with Limited Data: A Survey and Practical Guide

Fine-tuning large language models (LLMs) with limited data poses a practical challenge in low-resource languages, specialized domains, and constrained deployment settings. While pre-trained LLMs provide strong foundations, effective…

Computation and Language · Computer Science 2025-10-29 Marton Szep , Daniel Rueckert , Rüdiger von Eisenhart-Rothe , Florian Hinterwimmer

Synthetic Data Generation Using Large Language Models: Advances in Text and Code

This survey reviews how large language models (LLMs) are transforming synthetic training data generation in both natural language and code domains. By producing artificial but task-relevant examples, these models can significantly augment…

Computation and Language · Computer Science 2025-11-21 Mihai Nadas , Laura Diosan , Andreea Tomescu

DReSS: Data-driven Regularized Structured Streamlining for Large Language Models

Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs. Recent studies have revealed that LLMs exhibit sparsity, providing the…

Machine Learning · Computer Science 2025-07-01 Mingkuan Feng , Jinyang Wu , Shuai Zhang , Pengpeng Shao , Ruihan Jin , Zhengqi Wen , Jianhua Tao , Feihu Che

Large Language Models Are Overparameterized Text Encoders

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that…

Computation and Language · Computer Science 2024-10-21 Thennal D K , Tim Fischer , Chris Biemann

Frustratingly Easy Task-aware Pruning for Large Language Models

Pruning provides a practical solution to reduce the resources required to run large language models (LLMs) to benefit from their effective capabilities as well as control their cost for training and inference. Research on LLM pruning often…

Computation and Language · Computer Science 2025-10-28 Yuanhe Tian , Junjie Liu , Xican Yang , Haishan Ye , Yan Song

Superfiltering: Weak-to-Strong Data Filtering for Fast Instruction-Tuning

Instruction tuning is critical to improve LLMs but usually suffers from low-quality and redundant data. Data filtering for instruction tuning has proved important in improving both the efficiency and performance of the tuning process. But…

Computation and Language · Computer Science 2024-06-11 Ming Li , Yong Zhang , Shwai He , Zhitao Li , Hongyu Zhao , Jianzong Wang , Ning Cheng , Tianyi Zhou

On the Impact of Calibration Data in Post-training Quantization and Pruning

Quantization and pruning form the foundation of compression for neural networks, enabling efficient inference for large language models (LLMs). Recently, various quantization and pruning techniques have demonstrated remarkable performance…

Computation and Language · Computer Science 2024-11-06 Miles Williams , Nikolaos Aletras

Performance-Aligned LLMs for Generating Fast Code

Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-04-30 Daniel Nichols , Pranav Polasam , Harshitha Menon , Aniruddha Marathe , Todd Gamblin , Abhinav Bhatele

Escaping Collapse: The Strength of Weak Data for Large Language Model Training

Synthetically-generated data plays an increasingly larger role in training large language models. However, while synthetic data has been found to be useful, studies have also shown that without proper curation it can cause LLM performance…

Machine Learning · Computer Science 2025-12-02 Kareem Amin , Sara Babakniya , Alex Bie , Weiwei Kong , Umar Syed , Sergei Vassilvitskii