Related papers: CodeS: Towards Code Model Generalization Under Dis…

Estimating Predictive Uncertainty Under Program Data Distribution Shift

Deep learning (DL) techniques have achieved great success in predictive accuracy in a variety of tasks, but deep neural networks (DNNs) are shown to produce highly overconfident scores for even abnormal samples. Well-defined uncertainty…

Machine Learning · Computer Science 2021-07-26 Yufei Li , Simin Chen , Wei Yang

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Deep Learning (DL) techniques for Natural Language Processing have been evolving remarkably fast. Recently, the DL advances in language modeling, machine translation and paragraph understanding are so prominent that the potential of DL in…

Software Engineering · Computer Science 2020-06-16 Triet H. M. Le , Hao Chen , M. Ali Babar

Exploring Distributional Shifts in Large Language Models for Code Analysis

We systematically study how three large language models with code capabilities - CodeT5, Codex, and ChatGPT - generalize to out-of-domain data. We consider two fundamental applications - code summarization, and code generation. We split…

Computation and Language · Computer Science 2023-12-07 Shushan Arakelyan , Rocktim Jyoti Das , Yi Mao , Xiang Ren

Large Language Models (LLMs) for Source Code Analysis: applications, models and datasets

Large language models (LLMs) and transformer-based architectures are increasingly utilized for source code analysis. As software systems grow in complexity, integrating LLMs into code analysis workflows becomes essential for enhancing…

Software Engineering · Computer Science 2025-03-25 Hamed Jelodar , Mohammad Meymani , Roozbeh Razavi-Far

Commit2Vec: Learning Distributed Representations of Code Changes

Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available…

Software Engineering · Computer Science 2021-11-18 Rocìo Cabrera Lozoya , Arnaud Baumann , Antonino Sabetta , Michele Bezzi

Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study

Large language models (LLMs) have achieved state-of-the-art performance in various software engineering tasks, including error detection, clone detection, and code translation, primarily leveraging high-resource programming languages like…

Computation and Language · Computer Science 2025-06-11 Razan Baltaji , Saurabh Pujar , Louis Mandel , Martin Hirzel , Luca Buratti , Lav Varshney

Improving Zero-Shot Cross-Lingual Transfer via Progressive Code-Switching

Code-switching is a data augmentation scheme mixing words from multiple languages into source lingual text. It has achieved considerable generalization performance of cross-lingual transfer tasks by aligning cross-lingual contextual word…

Computation and Language · Computer Science 2024-06-21 Zhuoran Li , Chunming Hu , Junfan Chen , Zhijun Chen , Xiaohui Guo , Richong Zhang

Tackling Long-Tailed Category Distribution Under Domain Shifts

Machine learning models fail to perform well on real-world applications when 1) the category distribution P(Y) of the training dataset suffers from long-tailed distribution and 2) the test data is drawn from different conditional…

Computer Vision and Pattern Recognition · Computer Science 2022-07-22 Xiao Gu , Yao Guo , Zeju Li , Jianing Qiu , Qi Dou , Yuxuan Liu , Benny Lo , Guang-Zhong Yang

Uncertainty Awareness of Large Language Models Under Code Distribution Shifts: A Benchmark Study

Large Language Models (LLMs) have been widely employed in programming language analysis to enhance human productivity. Yet, their reliability can be compromised by various code distribution shifts, leading to inconsistent outputs. While…

Software Engineering · Computer Science 2024-02-12 Yufei Li , Simin Chen , Yanghong Guo , Wei Yang , Yue Dong , Cong Liu

CONCORD: Clone-aware Contrastive Learning for Source Code

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE…

Software Engineering · Computer Science 2023-06-07 Yangruibo Ding , Saikat Chakraborty , Luca Buratti , Saurabh Pujar , Alessandro Morari , Gail Kaiser , Baishakhi Ray

Maybe Deep Neural Networks are the Best Choice for Modeling Source Code

Statistical language modeling techniques have successfully been applied to source code, yielding a variety of new software development tools, such as tools for code suggestion and improving readability. A major issue with these techniques…

Software Engineering · Computer Science 2019-03-15 Rafael-Michael Karampatsis , Charles Sutton

CodeLL: A Lifelong Learning Dataset to Support the Co-Evolution of Data and Language Models of Code

Motivated by recent work on lifelong learning applications for language models (LMs) of code, we introduce CodeLL, a lifelong learning dataset focused on code changes. Our contribution addresses a notable research gap marked by the absence…

Software Engineering · Computer Science 2023-12-21 Martin Weyssow , Claudio Di Sipio , Davide Di Ruscio , Houari Sahraoui

A Survey of Deep Graph Learning under Distribution Shifts: from Graph Out-of-Distribution Generalization to Adaptation

Distribution shifts on graphs -- the discrepancies in data distribution between training and employing a graph machine learning model -- are ubiquitous and often unavoidable in real-world scenarios. These shifts may severely deteriorate…

Machine Learning · Computer Science 2025-03-31 Kexin Zhang , Shuhan Liu , Song Wang , Weili Shi , Chen Chen , Pan Li , Sheng Li , Jundong Li , Kaize Ding

Text Classification Under Class Distribution Shift: A Survey

The basic underlying assumption of machine learning (ML) models is that the training and test data are sampled from the same distribution. However, in daily practice, this assumption is often broken, i.e. the distribution of the test data…

Computation and Language · Computer Science 2026-01-16 Adriana Valentina Costache , Silviu Florin Gheorghe , Eduard Gabriel Poesina , Paul Irofti , Radu Tudor Ionescu

Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents…

Computation and Language · Computer Science 2024-05-08 Frances A. Laureano De Leon , Harish Tayyar Madabushi , Mark Lee

GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts

Geometric deep learning (GDL) has gained significant attention in scientific fields, for its proficiency in modeling data with intricate geometric structures. However, very few works have delved into its capability of tackling the…

Machine Learning · Computer Science 2024-11-21 Deyu Zou , Shikun Liu , Siqi Miao , Victor Fung , Shiyu Chang , Pan Li

Out of style: Misadventures with LLMs and code style transfer

Like text, programs have styles, and certain programming styles are more desirable than others for program readability, maintainability, and performance. Code style transfer, however, is difficult to automate except for trivial style…

Software Engineering · Computer Science 2024-06-18 Karl Munson , Chih-Kai Ting , Serenity Wade , Anish Savla , Julian Dolby , Kiran Kate , Kavitha Srinivas

CodeS: Towards Building Open-source Language Models for Text-to-SQL

Language models have shown promising performance on the task of translating natural language questions into SQL queries (Text-to-SQL). However, most of the state-of-the-art (SOTA) approaches rely on powerful yet closed-source large language…

Computation and Language · Computer Science 2024-02-27 Haoyang Li , Jing Zhang , Hanbing Liu , Ju Fan , Xiaokang Zhang , Jun Zhu , Renjie Wei , Hongyan Pan , Cuiping Li , Hong Chen

Opportunities and Challenges in Code Search Tools

Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged…

Software Engineering · Computer Science 2021-10-12 Chao Liu , Xin Xia , David Lo , Cuiyun Gao , Xiaohu Yang , John Grundy

Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution Detection

Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, training and testing distributions…

Software Engineering · Computer Science 2025-03-04 Yanfu Yan , Viet Duong , Huajie Shao , Denys Poshyvanyk