Stop Words for Processing Software Engineering Documents: Do they Matter?

Yaohou Fan; Chetan Arora; Christoph Treude

Stop Words for Processing Software Engineering Documents: Do they Matter?

Software Engineering 2023-06-13 v2 Computation and Language

Authors: Yaohou Fan , Chetan Arora , Christoph Treude

Abstract

Stop words, which are considered non-predictive, are often eliminated in natural language processing tasks. However, the definition of uninformative vocabulary is vague, so most algorithms use general knowledge-based stop lists to remove stop words. There is an ongoing debate among academics about the usefulness of stop word elimination, especially in domain-specific settings. In this work, we investigate the usefulness of stop word removal in a software engineering context. To do this, we replicate and experiment with three software engineering research tools from related work. Additionally, we construct a corpus of software engineering domain-related text from 10,000 Stack Overflow questions and identify 200 domain-specific stop words using traditional information-theoretic methods. Our results show that the use of domain-specific stop words significantly improved the performance of research tools compared to the use of a general stop list and that 17 out of 19 evaluation measures showed better performance. Online appendix: https://zenodo.org/record/7865748

Keywords

program analysis software engineering natural language processing

Cite

@article{arxiv.2303.10439,
  title  = {Stop Words for Processing Software Engineering Documents: Do they Matter?},
  author = {Yaohou Fan and Chetan Arora and Christoph Treude},
  journal= {arXiv preprint arXiv:2303.10439},
  year   = {2023}
}

Comments

Accepted for publication at the 2nd Intl. Workshop on NL-based Software Engineering (NLBSE 2023)

Stop Words for Processing Software Engineering Documents: Do they Matter?

Abstract

Keywords

Cite

Comments

Related papers