Related papers: Interactive Duplicate Search in Software Documenta…
Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of near duplicate fragments, i.e. chunks of text that were copied from a single source and were later…
We describe a system that helps identify manuscripts submitted to multiple journals at the same time. Also, we discuss potential applications of the near-duplicate detection technology when run with manuscript text content, including…
Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and…
Nowadays, digital content is widespread and simply redistributable, either lawfully or unlawfully. For example, after images are posted on the internet, other web users can modify them and then repost their versions, thereby generating…
Code clone detection is involved with detecting duplicated fragments of code within a code base. Detecting these clones is useful for maintenance operations which require editing the clones. The tools developed are expected to be robust…
Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of…
Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such a task is challenging when involving a web-scale image corpus containing billions of images. In this paper, we present…
Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or…
Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already…
All methodologies for detecting plagiarism to date have focused on the final digital "outcome", such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software…
More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore,…
One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate…
The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to…
We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous…
Machine Learning software documentation is different from most of the documentations that were studied in software engineering research. Often, the users of these documentations are not software experts. The increasing interest in using…
In the task of automatic program synthesis, one obtains pairs of matching inputs and outputs and generates a computer program, in a particular domain-specific language (DSL), which given each sample input returns the matching output. A key…
Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software…
Online learning platforms provide diverse questions to gauge the learners' understanding of different concepts. The repository of questions has to be constantly updated to ensure a diverse pool of questions to conduct assessments for…
There is a general belief that software must be able to easily do things that humans find difficult. Since finding sources for plagiarism in a text is not an easy task, there is a wide-spread expectation that it must be simple for software…
About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug…