English
Related papers

Related papers: TestAug: A Framework for Augmenting Capability-bas…

200 papers

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on…

Computation and Language · Computer Science 2020-05-11 Marco Tulio Ribeiro , Tongshuang Wu , Carlos Guestrin , Sameer Singh

The synergy between deep learning models and traditional automation tools, such as built-in tactics of the proof assistant and off-the-shelf automated theorem provers, plays a crucial role in developing robust and efficient neural theorem…

Machine Learning · Computer Science 2025-06-09 Haoxiong Liu , Jiacheng Sun , Zhenguo Li , Andrew C Yao

Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is…

Computation and Language · Computer Science 2023-11-06 Javier Ferrando , Matthias Sperber , Hendra Setiawan , Dominic Telaar , Saša Hasan

In recent years, the application of behavioral testing in Natural Language Processing (NLP) model evaluation has experienced a remarkable and substantial growth. However, the existing methods continue to be restricted by the requirements…

Software Engineering · Computer Science 2025-03-10 Hengrui Xing , Cong Tian , Liang Zhao , Zhi Ma , WenSheng Wang , Nan Zhang , Chao Huang , Zhenhua Duan

Software testing remains critical for ensuring reliability, yet traditional approaches are slow, costly, and prone to gaps in coverage. This paper presents an AI-driven framework that automates test case generation and validation using…

Software Engineering · Computer Science 2025-08-25 Saba Naqvi , Mohammad Baqar

Despite the success of ChatGPT, its performances on most NLP tasks are still well below the supervised baselines. In this work, we looked into the causes, and discovered that its subpar performance was caused by the following factors: (1)…

Computation and Language · Computer Science 2023-10-10 Xiaofei Sun , Linfeng Dong , Xiaoya Li , Zhen Wan , Shuhe Wang , Tianwei Zhang , Jiwei Li , Fei Cheng , Lingjuan Lyu , Fei Wu , Guoyin Wang

Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains…

Software Engineering · Computer Science 2025-09-30 Stefan Dascalescu , Adrian Marius Dumitran , Mihai Alexandru Vasiluta

Human annotation cost and time remain significant bottlenecks in Natural Language Processing (NLP), with test data annotation being particularly expensive due to the stringent requirement for low-error and high-quality labels necessary for…

Computation and Language · Computer Science 2026-03-24 Antonio Purificato , Maria Sofia Bucarelli , Andrea Bacciu , Amin Mantrach , Fabrizio Silvestri

Recent work in behavioral testing for natural language processing (NLP) models, such as Checklist, is inspired by related paradigms in software engineering testing. They allow evaluation of general linguistic capabilities and domain…

Computation and Language · Computer Science 2024-08-09 Ying Li , Rahul Singh , Tarun Joshi , Agus Sudjianto

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly…

Testing is essential to modern software engineering for building reliable software. Given the high costs of manually creating test cases, automated test case generation, particularly methods utilizing large language models, has become…

Software Engineering · Computer Science 2025-06-30 Yifeng He , Jicheng Wang , Yuyang Rong , Hao Chen

Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they…

Software Engineering · Computer Science 2024-05-21 Zhiqiang Yuan , Yiling Lou , Mingwei Liu , Shiji Ding , Kaixin Wang , Yixuan Chen , Xin Peng

Sensitivity of deep-neural models to input noise is known to be a challenging problem. In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors. To mitigate this issue, models may leverage…

Computation and Language · Computer Science 2021-11-18 Jakub Náplava , Martin Popel , Milan Straka , Jana Straková

Acceptance testing is a validation activity performed to ensure the conformance of software systems with respect to their functional requirements. In safety critical systems, it plays a crucial role since it is enforced by software…

Software Engineering · Computer Science 2020-05-19 Chunhui Wang , Fabrizio Pastore , Arda Goknil , Lionel C. Briand

The development of modern NLP applications often relies on various benchmark datasets containing plenty of manually labeled tests to evaluate performance. While constructing datasets often costs many resources, the performance on the…

Software Engineering · Computer Science 2023-08-01 Pin Ji , Yang Feng , Weitao Huang , Jia Liu , Zhihong Zhao

ChatGPT has achieved great success and can be considered to have acquired an infrastructural status. There are abundant works for evaluating ChatGPT on benchmarks. However, existing benchmarks encounter two challenges: (1) Disregard for…

Computation and Language · Computer Science 2024-06-19 Shangqing Tu , Chunyang Li , Jifan Yu , Xiaozhi Wang , Lei Hou , Juanzi Li

The prevalence of software systems has become an integral part of modern-day living. Software usage has increased significantly, leading to its growth in both size and complexity. Consequently, software development is becoming a more…

Software Engineering · Computer Science 2023-06-07 Tiago Dias , Arthur Batista , Eva Maia , Isabel Praça

Generating counterfactual test-cases is an important backbone for testing NLP models and making them as robust and reliable as traditional software. In generating the test-cases, a desired property is the ability to control the test-case…

Computation and Language · Computer Science 2022-06-22 Nishtha Madaan , Srikanta Bedathur , Diptikalyan Saha

Iterative evaluation of LLMs during training is essential to ensure expected capability development, but can be time- and compute-intensive. While NLU tasks, where the model selects from fixed answer choices, are cheap to evaluate,…

Computation and Language · Computer Science 2025-09-17 Viktor Hangya , Fabian Küch , Darina Gold

NLP-powered automatic question generation (QG) techniques carry great pedagogical potential of saving educators' time and benefiting student learning. Yet, QG systems have not been widely adopted in classrooms to date. In this work, we aim…

Human-Computer Interaction · Computer Science 2022-05-03 Xu Wang , Simin Fan , Jessica Houghton , Lu Wang
‹ Prev 1 2 3 10 Next ›