Related papers: Interactive Duplicate Search in Software Documenta…

Detecting Near Duplicates in Software Documentation

Contemporary software documentation is as complicated as the software itself. During its lifecycle, the documentation accumulates a lot of near duplicate fragments, i.e. chunks of text that were copied from a single source and were later…

Software Engineering · Computer Science 2018-10-10 D. V. Luciv , D. V. Koznov , G. A. Chernishev , A. N. Terekhov

How near-duplicate detection improves editors' and authors' publishing experience

We describe a system that helps identify manuscripts submitted to multiple journals at the same time. Also, we discuss potential applications of the near-duplicate detection technology when run with manuscript text content, including…

Digital Libraries · Computer Science 2021-08-12 Yury Kashnitsky , Vaishnavi Kandala , Egbert van Wezenbeek , IJsbrand Jan Aalbersberg , Catriona Fennell , Georgios Tsatsaronis

The Impact of Main Content Extraction on Near-Duplicate Detection

Commercial web search engines employ near-duplicate detection to ensure that users see each relevant result only once, albeit the underlying web crawls typically include (near-)duplicates of many web pages. We revisit the risks and…

Information Retrieval · Computer Science 2021-11-23 Maik Fröbe , Matthias Hagen , Janek Bevendorff , Michael Völske , Benno Stein , Christopher Schröder , Robby Wagner , Lukas Gienapp , Martin Potthast

A Review on Near Duplicate Detection of Images using Computer Vision Techniques

Nowadays, digital content is widespread and simply redistributable, either lawfully or unlawfully. For example, after images are posted on the internet, other web users can modify them and then repost their versions, thereby generating…

Computer Vision and Pattern Recognition · Computer Science 2020-09-08 K. K. Thyagharajan , G. Kalaiarasi

Detecting Code Clones: A review

Code clone detection is involved with detecting duplicated fragments of code within a code base. Detecting these clones is useful for maintenance operations which require editing the clones. The tools developed are expected to be robust…

Software Engineering · Computer Science 2016-05-10 Ogechi Onuoha

On sampling from data with duplicate records

Data deduplication is the task of detecting records in a database that correspond to the same real-world entity. Our goal is to develop a procedure that samples uniformly from the set of entities present in the database in the presence of…

Machine Learning · Computer Science 2020-08-25 Alireza Heidari , Shrinu Kushagra , Ihab F. Ilyas

Evolution of a Web-Scale Near Duplicate Image Detection System

Detecting near duplicate images is fundamental to the content ecosystem of photo sharing web applications. However, such a task is challenging when involving a web-scale image corpus containing billions of images. In this paper, we present…

Computer Vision and Pattern Recognition · Computer Science 2022-09-20 Andrey Gusev , Jiajing Xu

Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection

Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or…

Computation and Language · Computer Science 2024-06-11 Matthias Engelbach , Dennis Klau , Maximilien Kintz , Alexander Ulrich

Towards Scalable Generation of Realistic Test Data for Duplicate Detection

Due to the increasing volume, volatility, and diversity of data in virtually all areas of our lives, the ability to detect duplicates in potentially linked data sources is more important than ever before. However, while research is already…

Databases · Computer Science 2024-01-01 Fabian Panse , Wolfram Wingerath , Benjamin Wollmer

Detecting Plagiarism based on the Creation Process

All methodologies for detecting plagiarism to date have focused on the final digital "outcome", such as a document or source code. Our novel approach takes the creation process into account using logged events collected by special software…

Other Computer Science · Computer Science 2017-07-21 Johannes Schneider , Avi Bernstein , Jan Vom Brocke , Kostadin Damevski , David C. Shepherd

Automating the search for a patent's prior art with a full text similarity search

More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore,…

Information Retrieval · Computer Science 2019-03-06 Lea Helmers , Franziska Horn , Franziska Biegler , Tim Oppermann , Klaus-Robert Müller

One of the important factors that make a search engine fast and accurate is a concise and duplicate free index. In order to remove duplicate and near-duplicate documents from the index, a search engine needs a swift and reliable duplicate…

Information Retrieval · Computer Science 2019-09-26 Hamid Mohammadi , Seyed Hossein Khasteh

Multi-reference Cosine: A New Approach to Text Similarity Measurement in Large Collections

The importance of an efficient and scalable document similarity detection system is undeniable nowadays. Search engines need batch text similarity measures to detect duplicated and near-duplicated web pages in their indexes in order to…

Information Retrieval · Computer Science 2018-10-09 Hamid Mohammadi , Amin Nikoukaran

Duplicate Detection with Efficient Language Models for Automatic Bibliographic Heterogeneous Data Integration

We present a new method to detect duplicates used to merge different bibliographic record corpora with the help of lexical and social information. As we show, a trivial key is not available to delete useless documents. Merging heteregeneous…

Databases · Computer Science 2015-04-29 Nicolas Turenne

Documentation of Machine Learning Software

Machine Learning software documentation is different from most of the documentations that were studied in software engineering research. Often, the users of these documentations are not software experts. The increasing interest in using…

Software Engineering · Computer Science 2020-02-03 Yalda Hashemi , Maleknaz Nayebi , Giuliano Antoniol

Improved Tree Search for Automatic Program Synthesis

In the task of automatic program synthesis, one obtains pairs of matching inputs and outputs and generates a computer program, in a particular domain-specific language (DSL), which given each sample input returns the matching output. A key…

Machine Learning · Computer Science 2023-03-14 Aran Carmon , Lior Wolf

Human-Like Summaries from Heterogeneous and Time-Windowed Software Development Artefacts

Automatic text summarisation has drawn considerable interest in the area of software engineering. It is challenging to summarise the activities related to a software project, (1) because of the volume and heterogeneity of involved software…

Software Engineering · Computer Science 2020-04-30 Mahfouth Alghamdi , Christoph Treude , Markus Wagner

Unsupervised Question Duplicate and Related Questions Detection in e-learning platforms

Online learning platforms provide diverse questions to gauge the learners' understanding of different concepts. The repository of questions has to be constantly updated to ensure a diverse pool of questions to conduct assessments for…

Computation and Language · Computer Science 2023-01-13 Maksimjeet Chowdhary , Sanyam Goyal , Venktesh V , Mukesh Mohania , Vikram Goyal

Testing of Support Tools for Plagiarism Detection

There is a general belief that software must be able to easily do things that humans find difficult. Since finding sources for plagiarism in a text is not an easy task, there is a wide-spread expectation that it must be simple for software…

Digital Libraries · Computer Science 2023-06-21 Tomáš Foltýnek , Dita Dlabolová , Alla Anohina-Naumeca , Salim Razı , Július Kravjar , Laima Kamzola , Jean Guerrero-Dib , Özgür Çelik , Debora Weber-Wulff

Towards Understanding the Impacts of Textual Dissimilarity on Duplicate Bug Report Detection

About 40% of software bug reports are duplicates of one another, which pose a major overhead during software maintenance. Traditional techniques often focus on detecting duplicate bug reports that are textually similar. However, in bug…

Software Engineering · Computer Science 2022-12-21 Sigma Jahan , Mohammad Masudur Rahman