Related papers: Contextualized Data-Wrangling Code Generation in C…

Integrating Code Metrics into Automated Documentation Generation for Computational Notebooks

Effective code documentation is essential for collaboration, comprehension, and long-term software maintainability, yet developers often neglect it due to its repetitive nature. Automated documentation generation has evolved from heuristic…

Software Engineering · Computer Science 2026-02-10 Mojtaba Mostafavi Ghahfarokhi , Hamed Jahantigh , Alireza Asadi , Abbas Heydarnoori

Data Wrangling Task Automation Using Code-Generating Language Models

Ensuring data quality in large tabular datasets is a critical challenge, typically addressed through data wrangling tasks. Traditional statistical methods, though efficient, cannot often understand the semantic context and deep learning…

Machine Learning · Computer Science 2025-02-25 Ashlesha Akella , Krishnasuri Narayanam

HAConvGNN: Hierarchical Attention Based Convolutional Graph Neural Network for Code Documentation Generation in Jupyter Notebooks

Jupyter notebook allows data scientists to write machine learning code together with its documentation in cells. In this paper, we propose a new task of code documentation generation (CDG) for computational notebooks. In contrast to the…

Software Engineering · Computer Science 2021-09-10 Xuye Liu , Dakuo Wang , April Wang , Yufang Hou , Lingfei Wu

Development of Data Evaluation Benchmark for Data Wrangling Recommendation System

CoWrangler is a data-wrangling recommender system designed to streamline data processing tasks. Recognizing that data processing is often time-consuming and complex for novice users, we aim to simplify the decision-making process regarding…

Databases · Computer Science 2024-09-18 Yuqing Wang , Anna Fariha

Natural Language to Code Generation in Interactive Data Science Notebooks

Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that…

Computation and Language · Computer Science 2022-12-20 Pengcheng Yin , Wen-Ding Li , Kefan Xiao , Abhishek Rao , Yeming Wen , Kensen Shi , Joshua Howland , Paige Bailey , Michele Catasta , Henryk Michalewski , Alex Polozov , Charles Sutton

CoreGen: Contextualized Code Representation Learning for Commit Message Generation

Automatic generation of high-quality commit messages for code commits can substantially facilitate software developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for…

Computation and Language · Computer Science 2021-06-22 Lun Yiu Nie , Cuiyun Gao , Zhicong Zhong , Wai Lam , Yang Liu , Zenglin Xu

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation.…

Software Engineering · Computer Science 2021-03-17 Shuai Lu , Daya Guo , Shuo Ren , Junjie Huang , Alexey Svyatkovskiy , Ambrosio Blanco , Colin Clement , Dawn Drain , Daxin Jiang , Duyu Tang , Ge Li , Lidong Zhou , Linjun Shou , Long Zhou , Michele Tufano , Ming Gong , Ming Zhou , Nan Duan , Neel Sundaresan , Shao Kun Deng , Shengyu Fu , Shujie Liu

Code Generation Techniques for Raw Data Processing

The motivation of the current study was to design an algorithm that can speed up the processing of a query. The important feature is generating code dynamically for a specific query. We present the technique of code generation that is…

Databases · Computer Science 2017-12-12 Xin Zhang

Contextualized Code Pretraining for Code Generation

As code generation becomes increasingly central to improving software development efficiency, modern code models are largely trained and evaluated on code with natural-language descriptions. In real projects, developers often implement…

Software Engineering · Computer Science 2026-05-19 Chen Liu , Qingyuan Liang , Hanwen Zhang , Zeyu Sun , Yakun Zhang , Lu Zhang

Code Code Evolution: Understanding How People Change Data Science Notebooks Over Time

Sensemaking is the iterative process of identifying, extracting, and explaining insights from data, where each iteration is referred to as the "sensemaking loop." Although recent work observes snapshots of the sensemaking loop within…

Human-Computer Interaction · Computer Science 2022-09-09 Deepthi Raghunandan , Aayushi Roy , Shenzhi Shi , Niklas Elmqvist , Leilani Battle

Completion by Comprehension: Guiding Code Generation with Multi-Granularity Understanding

As code completion task from function-level to repository-level, leveraging contextual information from large-scale codebases becomes a core challenge. However, existing retrieval-augmented generation (RAG) methods typically treat code as…

Software Engineering · Computer Science 2025-12-05 Xinkui Zhao , Rongkai Liu , Yifan Zhang , Chen Zhi , Lufei Zhang , Guanjie Cheng , Yueshen Xu , Shuiguang Deng , Jianwei Yin

CoCoMIC: Code Completion By Jointly Modeling In-file and Cross-file Context

While pre-trained language models (LM) for code have achieved great success in code completion, they generate code conditioned only on the contents within the file, i.e., in-file context, but ignore the rich semantics in other files within…

Computation and Language · Computer Science 2023-05-25 Yangruibo Ding , Zijian Wang , Wasi Uddin Ahmad , Murali Krishna Ramanathan , Ramesh Nallapati , Parminder Bhatia , Dan Roth , Bing Xiang

CoCoSum: Contextual Code Summarization with Multi-Relational Graph Neural Network

Source code summaries are short natural language descriptions of code snippets that help developers better understand and maintain source code. There has been a surge of work on automatic code summarization to reduce the burden of writing…

Software Engineering · Computer Science 2021-07-06 Yanlin Wang , Ensheng Shi , Lun Du , Xiaodi Yang , Yuxuan Hu , Shi Han , Hongyu Zhang , Dongmei Zhang

Learning to Reason via Program Generation, Emulation, and Search

Program synthesis with language models (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word…

Computation and Language · Computer Science 2024-11-05 Nathaniel Weir , Muhammad Khalifa , Linlu Qiu , Orion Weller , Peter Clark

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned…

Machine Learning · Computer Science 2019-10-10 Rajas Agashe , Srinivasan Iyer , Luke Zettlemoyer

Iterative Refinement of Project-Level Code Context for Precise Code Generation with Compiler Feedback

Large Language Models (LLMs) have shown remarkable progress in automated code generation. Yet, LLM-generated code may contain errors in API usage, class, data structure, or missing project-specific information. As much of this…

Computation and Language · Computer Science 2024-06-12 Zhangqian Bi , Yao Wan , Zheng Wang , Hongyu Zhang , Batu Guan , Fangxin Lu , Zili Zhang , Yulei Sui , Hai Jin , Xuanhua Shi

CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance

One of the central tasks in software maintenance is being able to understand and develop code changes. Thus, given a natural language description of the desired new operation of a function, an agent (human or AI) might be asked to generate…

Software Engineering · Computer Science 2025-02-05 Kunal Pai , Premkumar Devanbu , Toufique Ahmed

COINS: Dynamically Generating COntextualized Inference Rules for Narrative Story Completion

Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules,…

Computation and Language · Computer Science 2021-06-07 Debjit Paul , Anette Frank

KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding

We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing…

Machine Learning · Computer Science 2025-07-15 Zhangchen Xu , Yang Liu , Yueqin Yin , Mingyuan Zhou , Radha Poovendran

Two-Stage Data-Driven Contextual Robust Optimization: An End-to-End Learning Approach for Online Energy Applications

Traditional end-to-end contextual robust optimization models are trained for specific contextual data, requiring complete retraining whenever new contextual information arrives. This limitation hampers their use in online decision-making…

Optimization and Control · Mathematics 2025-10-20 Carlos Gamboa , Alexandre Street , Davi Valladão , Bernardo Pagnocelli