Related papers: JEMMA: An Extensible Java Dataset for ML4Code Appl…

JEMA: A Joint Embedding Framework for Scalable Co-Learning with Multimodal Alignment

This work introduces JEMA (Joint Embedding with Multimodal Alignment), a novel co-learning framework tailored for laser metal deposition (LMD), a pivotal process in metal additive manufacturing. As Industry 5.0 gains traction in industrial…

Computer Vision and Pattern Recognition · Computer Science 2024-11-01 Joao Sousa , Roya Darabi , Armando Sousa , Frank Brueckner , Luís Paulo Reis , Ana Reis

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However,…

Software Engineering · Computer Science 2022-10-31 Anastasia Drozdova , Polina Guseva , Ekaterina Trofimova , Anna Scherbakova , Andrey Ustyuzhanin

Assessing Project-Level Fine-Tuning of ML4SE Models

Machine Learning for Software Engineering (ML4SE) is an actively growing research area that focuses on methods that help programmers in their work. In order to apply the developed methods in practice, they need to achieve reasonable quality…

Software Engineering · Computer Science 2022-06-08 Egor Bogomolov , Sergey Zhuravlev , Egor Spirin , Timofey Bryksin

ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code

In recent years, the application of large language models (LLMs) to code-related tasks has gained significant attention. However, existing evaluation benchmarks often focus on limited scenarios, such as code generation or completion, which…

Software Engineering · Computer Science 2024-09-17 Jia Feng , Jiachen Liu , Cuiyun Gao , Chun Yong Chong , Chaozheng Wang , Shan Gao , Xin Xia

COMEX: A Tool for Generating Customized Source Code Representations

Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as…

Software Engineering · Computer Science 2023-07-11 Debeshee Das , Noble Saji Mathews , Alex Mathai , Srikanth Tamilselvam , Kranthi Sedamaki , Sridhar Chimalakonda , Atul Kumar

Increasing LLM Coding Capabilities through Diverse Synthetic Coding Tasks

Large language models (LLMs) have shown impressive promise in code generation, yet their progress remains limited by the shortage of large-scale datasets that are both diverse and well-aligned with human reasoning. Most existing resources…

Machine Learning · Computer Science 2025-10-28 Amal Abed , Ivan Lukic , Jörg K. H. Franke , Frank Hutter

ML4Chem: A Machine Learning Package for Chemistry and Materials Science

ML4Chem is an open-source machine learning library for chemistry and materials science. It provides an extendable platform to develop and deploy machine learning models and pipelines and is targeted to the non-expert and expert users.…

Chemical Physics · Physics 2020-03-31 Muammar El Khatib , Wibe A de Jong

CodeComplex: Dataset for Worst-Case Time Complexity Prediction

Reasoning ability of Large Language Models (LLMs) is a crucial ability, especially in complex decision-making tasks. One significant task to show LLMs' reasoning capability is code time complexity prediction, which involves various…

Software Engineering · Computer Science 2024-12-25 Seung-Yeop Baik , Joonghyuk Hahn , Jungin Kim , Mingi Jeon , Aditi , Yo-Sub Han , Sang-Ki Ko

Megadiff: A Dataset of 600k Java Source Code Changes Categorized by Diff Size

This paper presents Megadiff, a dataset of source code diffs. It focuses on Java, with strict inclusion criteria based on commit message and diff size. Megadiff contains 663 029 Java diffs that can be used for research on commit…

Software Engineering · Computer Science 2021-08-11 Martin Monperrus , Matias Martinez , He Ye , Fernanda Madeiral , Thomas Durieux , Zhongxing Yu

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains…

Cryptography and Security · Computer Science 2023-08-10 Yizheng Chen , Zhoujie Ding , Lamya Alowain , Xinyun Chen , David Wagner

Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing…

Software Engineering · Computer Science 2025-03-25 Hamed Jelodar , Mohammad Meymani , Roozbeh Razavi-Far

On the use of LLMs to generate a dataset of Neural Networks

Neural networks are increasingly used to support decision-making. To verify their reliability and adaptability, researchers and practitioners have proposed a variety of tools and methods for tasks such as NN code verification, refactoring,…

Machine Learning · Computer Science 2026-02-05 Nadia Daoudi , Jordi Cabot

MigrationBench: Repository-Level Code Migration Benchmark from Java 8

With the rapid advancement of powerful large language models (LLMs) in recent years, a wide range of software engineering tasks can now be addressed using LLMs, significantly enhancing productivity and scalability. Numerous benchmark…

Software Engineering · Computer Science 2026-05-29 Linbo Liu , Xinle Liu , Qiang Zhou , Lin Chen , Yihan Liu , Hoan Nguyen , Behrooz Omidvar-Tehrani , Xi Shen , Jun Huan , Omer Tripp , Anoop Deoras

SnipGen: A Mining Repository Framework for Evaluating LLMs for Code

Language Models (LLMs), such as transformer-based neural networks trained on billions of parameters, have become increasingly prevalent in software engineering (SE). These models, trained on extensive datasets that include code…

Software Engineering · Computer Science 2025-02-18 Daniel Rodriguez-Cardenas , Alejandro Velasco , Denys Poshyvanyk

jMT: Testing Correctness of Java Memory Models (Extended Version)

Folklore is often saying "The Java memory model is broken." Therefore, several approaches have proposed repairs, only to find new programs exhibiting unexpected, unintuitive behavior or the model forbidding standard compiler optimizations.…

Programming Languages · Computer Science 2026-04-20 Lukas Panneke , Heike Wehrheim

Automated and Context-Aware Code Documentation Leveraging Advanced LLMs

Code documentation is essential to improve software maintainability and comprehension. The tedious nature of manual code documentation has led to much research on automated documentation generation. Existing automated approaches primarily…

Software Engineering · Computer Science 2025-09-19 Swapnil Sharma Sarker , Tanzina Taher Ifty

On the Evaluation of Large Language Models in Unit Test Generation

Unit testing is an essential activity in software development for verifying the correctness of software components. However, manually writing unit tests is challenging and time-consuming. The emergence of Large Language Models (LLMs) offers…

Software Engineering · Computer Science 2024-09-26 Lin Yang , Chen Yang , Shutao Gao , Weijing Wang , Bo Wang , Qihao Zhu , Xiao Chu , Jianyi Zhou , Guangtai Liang , Qianxiang Wang , Junjie Chen

Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-03-25 Chen Zhang , Kuntai Du , Shu Liu , Woosuk Kwon , Xiangxi Mo , Yufeng Wang , Xiaoxuan Liu , Kaichao You , Zhuohan Li , Mingsheng Long , Jidong Zhai , Joseph Gonzalez , Ion Stoica

CoderUJB: An Executable and Unified Java Benchmark for Practical Programming Scenarios

In the evolving landscape of large language models (LLMs) tailored for software engineering, the need for benchmarks that accurately reflect real-world development scenarios is paramount. Current benchmarks are either too simplistic or fail…

Software Engineering · Computer Science 2024-03-29 Zhengran Zeng , Yidong Wang , Rui Xie , Wei Ye , Shikun Zhang

The Code Barrier: What LLMs Actually Understand?

Understanding code represents a core ability needed for automating software development tasks. While foundation models like LLMs show impressive results across many software engineering challenges, the extent of their true semantic…

Software Engineering · Computer Science 2025-04-16 Serge Lionel Nikiema , Jordan Samhi , Abdoul Kader Kaboré , Jacques Klein , Tegawendé F. Bissyandé