Related papers: ManyTypes4Py: A Benchmark Python Dataset for Machi…

Type4Py: Practical Deep Similarity Learning-Based Type Inference for Python

Dynamic languages, such as Python and Javascript, trade static typing for developer flexibility and productivity. Lack of static typing can cause run-time exceptions and is a major factor for weak IDE support. To alleviate these issues, PEP…

Machine Learning · Computer Science 2022-01-20 Amir M. Mir , Evaldas Latoskinas , Sebastian Proksch , Georgios Gousios

TypeEvalPy: A Micro-benchmarking Framework for Python Type Inference Tools

In light of the growing interest in type inference research for Python, both researchers and practitioners require a standardized process to assess the performance of various type inference techniques. This paper introduces TypeEvalPy, a…

Software Engineering · Computer Science 2024-01-03 Ashwin Prasad Shivarpatna Venkatesh , Samkutty Sabu , Jiawei Wang , Amir M. Mir , Li Li , Eric Bodden

Automated Type Annotation in Python Using Large Language Models

Type annotations in Python enhance maintainability and error detection. However, generating these annotations manually is error prone and requires extra effort. Traditional automation approaches like static analysis, machine learning, and…

Programming Languages · Computer Science 2025-08-04 Varun Bharti , Shashwat Jha , Dhruv Kumar , Pankaj Jalote

Cross-Domain Evaluation of a Deep Learning-Based Type Inference System

Optional type annotations allow for enriching dynamic programming languages with static typing features like better Integrated Development Environment (IDE) support, more precise program analysis, and early detection and prevention of…

Software Engineering · Computer Science 2023-07-31 Bernd Gruner , Tim Sonnekalb , Thomas S. Heinze , Clemens-Alexander Brust

Large Scale Generation of Labeled Type Data for Python

Recently, dynamically typed languages, such as Python, have gained unprecedented popularity. Although these languages alleviate the need for mandatory type annotations, types still play a critical role in program understanding and…

Programming Languages · Computer Science 2022-02-08 Ibrahim Abdelaziz , Julian Dolby , Kavitha Srinivas

NICHE: A Curated Dataset of Engineered Machine Learning Projects in Python

Machine learning (ML) has gained much attention and been incorporated into our daily lives. While there are numerous publicly available ML projects on open source platforms such as GitHub, there have been limited attempts in filtering those…

Software Engineering · Computer Science 2023-03-14 Ratnadira Widyasari , Zhou Yang , Ferdian Thung , Sheng Qin Sim , Fiona Wee , Camellia Lok , Jack Phan , Haodi Qi , Constance Tan , Qijin Tay , David Lo

Type-aware LLM-based Regression Test Generation for Python Programs

Automated regression test generation has been extensively explored, yet generating high-quality tests for Python programs remains particularly challenging. Because of the Python's dynamic typing features, existing approaches, ranging from…

Software Engineering · Computer Science 2025-10-23 Runlin Liu , Zhe Zhang , Yunge Hu , Yuhang Lin , Xiang Gao , Hailong Sun

PYInfer: Deep Learning Semantic Type Inference for Python Variables

Python type inference is challenging in practice. Due to its dynamic properties and extensive dependencies on third-party libraries without type annotations, the performance of traditional static analysis techniques is limited. Although…

Software Engineering · Computer Science 2021-06-29 Siwei Cui , Gang Zhao , Zeyu Dai , Luochao Wang , Ruihong Huang , Jeff Huang

Typify: A Lightweight Usage-driven Static Analyzer for Precise Python Type Inference

Python's dynamic type system, while offering significant flexibility and expressiveness, poses substantial challenges for static analysis and automated tooling, particularly in unannotated or partially annotated codebases. Existing type…

Software Engineering · Computer Science 2026-04-08 Ali Aman , Muhammad Asaduzzaman , Shaowei Wang

pyMethods2Test: A Dataset of Python Tests Mapped to Focal Methods

Python is one of the fastest-growing programming languages and currently ranks as the top language in many lists, even recently overtaking JavaScript as the top language on GitHub. Given its importance in data science and machine learning,…

Software Engineering · Computer Science 2025-02-10 Idriss Abdelmadjid , Robert Dyer

DataSist: A Python-based library for easy data analysis, visualization and modeling

A large amount of data is produced every second from modern information systems such as mobile devices, the world wide web, Internet of Things, social media, etc. Analysis and mining of this massive data requires a lot of advanced tools and…

Machine Learning · Computer Science 2020-01-13 Rising Odegua , Festus Ikpotokin

An Empirical Study of Large Language Models for Type and Call Graph Analysis in Python and JavaScript

Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and…

Software Engineering · Computer Science 2025-07-17 Ashwin Prasad Shivarpatna Venkatesh , Rose Sunil , Samkutty Sabu , Amir M. Mir , Sofia Reis , Eric Bodden

Towards Automatic Translation of Machine Learning Visual Insights to Analytical Assertions

We present our vision for developing an automated tool capable of translating visual properties observed in Machine Learning (ML) visualisations into Python assertions. The tool aims to streamline the process of manually verifying these…

Software Engineering · Computer Science 2024-01-17 Arumoy Shome , Luis Cruz , Arie van Deursen

So Much in So Little: Creating Lightweight Embeddings of Python Libraries

In software engineering, different approaches and machine learning models leverage different types of data: source code, textual information, historical data. An important part of any project is its dependencies. The list of dependencies is…

Software Engineering · Computer Science 2022-09-09 Yaroslav Golubev , Egor Bogomolov , Egor Bulychev , Timofey Bryksin

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Program code as a data source is gaining popularity in the data science community. Possible applications for models trained on such assets range from classification for data dimensionality reduction to automatic code generation. However,…

Software Engineering · Computer Science 2022-10-31 Anastasia Drozdova , Polina Guseva , Ekaterina Trofimova , Anna Scherbakova , Andrey Ustyuzhanin

Defectors: A Large, Diverse Python Dataset for Defect Prediction

Defect prediction has been a popular research topic where machine learning (ML) and deep learning (DL) have found numerous applications. However, these ML/DL-based defect prediction models are often limited by the quality and size of their…

Software Engineering · Computer Science 2023-07-26 Parvez Mahbub , Ohiduzzaman Shuvo , Mohammad Masudur Rahman

Resolvent4py: a parallel Python package for analysis, model reduction and control of large-scale linear systems

In this paper, we present resolvent4py, a parallel Python package for the analysis, model reduction and control of large-scale linear systems with millions or billions of degrees of freedom. This package provides the user with a friendly…

Computational Physics · Physics 2026-01-13 Alberto Padovan , Vishal Anantharaman , Clarence W. Rowley , Blaine Vollmer , Tim Colonius , Daniel J. Bodony

OpenClassGen: A Large-Scale Corpus of Real-World Python Classes for LLM Research

Existing class-level code generation datasets are either synthetic (ClassEval: 100 classes) or insufficient in scale for modern training needs (RealClassEval: 400 classes), hindering robust evaluation and empirical analysis. We present…

Software Engineering · Computer Science 2026-05-01 Musfiqur Rahman , SayedHassan Khatoonabadi , Emad Shihab

Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented…

Machine Learning · Computer Science 2016-09-22 Guillaume Lemaitre , Fernando Nogueira , Christos K. Aridas

CodeInsight: A Curated Dataset of Practical Coding Solutions from Stack Overflow

We introduce a novel dataset tailored for code generation, aimed at aiding developers in common tasks. Our dataset provides examples that include a clarified intent, code snippets associated, and an average of three related unit tests. It…

Computation and Language · Computer Science 2024-09-26 Nathanaël Beau , Benoît Crabbé