Related papers: TestAug: A Framework for Augmenting Capability-bas…

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on…

Computation and Language · Computer Science 2020-05-11 Marco Tulio Ribeiro , Tongshuang Wu , Carlos Guestrin , Sameer Singh

ProofAug: Efficient Neural Theorem Proving via Fine-grained Proof Structure Analysis

The synergy between deep learning models and traditional automation tools, such as built-in tactics of the proof assistant and off-the-shelf automated theorem provers, plays a crucial role in developing robust and efficient neural theorem…

Machine Learning · Computer Science 2025-06-09 Haoxiong Liu , Jiacheng Sun , Zhenguo Li , Andrew C Yao

Automating Behavioral Testing in Machine Translation

Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is…

Computation and Language · Computer Science 2023-11-06 Javier Ferrando , Matthias Sperber , Hendra Setiawan , Dominic Telaar , Saša Hasan

AutoTestForge: A Multidimensional Automated Testing Framework for Natural Language Processing Models

In recent years, the application of behavioral testing in Natural Language Processing (NLP) model evaluation has experienced a remarkable and substantial growth. However, the existing methods continue to be restricted by the requirements…

Software Engineering · Computer Science 2025-03-10 Hengrui Xing , Cong Tian , Liang Zhao , Zhi Ma , WenSheng Wang , Nan Zhang , Chao Huang , Zhenhua Duan

Breaking Barriers in Software Testing: The Power of AI-Driven Automation

Software testing remains critical for ensuring reliability, yet traditional approaches are slow, costly, and prone to gaps in coverage. This paper presents an AI-driven framework that automates test case generation and validation using…

Software Engineering · Computer Science 2025-08-25 Saba Naqvi , Mohammad Baqar

Pushing the Limits of ChatGPT on NLP Tasks

Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors: (1)…

Computation and Language · Computer Science 2023-10-10 Xiaofei Sun , Linfeng Dong , Xiaoya Li , Zhen Wan , Shuhe Wang , Tianwei Zhang , Jiwei Li , Fei Cheng , Lingjuan Lyu , Fei Wu , Guoyin Wang

Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains…

Software Engineering · Computer Science 2025-09-30 Stefan Dascalescu , Adrian Marius Dumitran , Mihai Alexandru Vasiluta

Select, Label, Evaluate: Active Testing in NLP

Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for…

Computation and Language · Computer Science 2026-03-24 Antonio Purificato , Maria Sofia Bucarelli , Andrea Bacciu , Amin Mantrach , Fabrizio Silvestri

Automatic Generation of Behavioral Test Cases For Natural Language Processing Using Clustering and Prompting

Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain…

Computation and Language · Computer Science 2024-08-09 Ying Li , Rahul Singh , Tarun Joshi , Agus Sudjianto

Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly…

Computation and Language · Computer Science 2021-06-18 Simon Mille , Kaustubh D. Dhole , Saad Mahamood , Laura Perez-Beltrachini , Varun Gangal , Mihir Kale , Emiel van Miltenburg , Sebastian Gehrmann

FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Testing is essential to modern software engineering for building reliable software. Given the high costs of manually creating test cases, automated test case generation, particularly methods utilizing large language models, has become…

Software Engineering · Computer Science 2025-06-30 Yifeng He , Jicheng Wang , Yuyang Rong , Hao Chen

No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they…

Software Engineering · Computer Science 2024-05-21 Zhiqiang Yuan , Yiling Lou , Mingwei Liu , Shiji Ding , Kaixin Wang , Yixuan Chen , Xin Peng

Understanding Model Robustness to User-generated Noisy Texts

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage…

Computation and Language · Computer Science 2021-11-18 Jakub Náplava , Martin Popel , Milan Straka , Jana Straková

Automatic Generation of Acceptance Test Cases from Use Case Specifications: an NLP-based Approach

Acceptance testing is a validation activity performed to ensure the conformance of software systems with respect to their functional requirements. In safety critical systems, it plays a crucial role since it is enforced by software…

Software Engineering · Computer Science 2020-05-19 Chunhui Wang , Fabrizio Pastore , Arda Goknil , Lionel C. Briand

Intergenerational Test Generation for Natural Language Processing Applications

The development of modern NLP applications often relies on various benchmark datasets containing plenty of manually labeled tests to evaluate performance. While constructing datasets often costs many resources, the performance on the…

Software Engineering · Computer Science 2023-08-01 Pin Ji , Yang Feng , Weitao Huang , Jia Liu , Zhihong Zhao

ChatLog: Carefully Evaluating the Evolution of ChatGPT Across Time

ChatGPT has achieved great success and can be considered to have acquired an infrastructural status. There are abundant works for evaluating ChatGPT on benchmarks. However, existing benchmarks encounter two challenges: (1) Disregard for…

Computation and Language · Computer Science 2024-06-19 Shangqing Tu , Chunyang Li , Jifan Yu , Xiaozhi Wang , Lei Hou , Juanzi Li

TestLab: An Intelligent Automated Software Testing Framework

The prevalence of software systems has become an integral part of modern-day living. Software usage has increased significantly, leading to its growth in both size and complexity. Consequently, software development is becoming a more…

Software Engineering · Computer Science 2023-06-07 Tiago Dias , Arthur Batista , Eva Maia , Isabel Praça

Plug and Play Counterfactual Text Generation for Model Robustness

Generating counterfactual test-cases is an important backbone for testing NLP models and making them as robust and reliable as traditional software. In generating the test-cases, a desired property is the ability to control the test-case…

Computation and Language · Computer Science 2022-06-22 Nishtha Madaan , Srikanta Bedathur , Diptikalyan Saha

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate,…

Computation and Language · Computer Science 2025-09-17 Viktor Hangya , Fabian Küch , Darina Gold

Towards Process-Oriented, Modular, and Versatile Question Generation that Meets Educational Needs

NLP-powered automatic question generation (QG) techniques carry great pedagogical potential of saving educators' time and benefiting student learning. Yet, QG systems have not been widely adopted in classrooms to date. In this work, we aim…

Human-Computer Interaction · Computer Science 2022-05-03 Xu Wang , Simin Fan , Jessica Houghton , Lu Wang