Related papers: Authorship Analysis based on Data Compression
Compression-based similarity measures are effectively employed in applications on diverse data types with a basically parameter-free approach. Nevertheless, there are problems in applying these techniques to medium-to-large datasets which…
Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric…
Patent similarity analysis plays a crucial role in evaluating the risk of patent infringement. Nonetheless, this analysis is predominantly conducted manually by legal experts, often resulting in a time-consuming process. Recent advances in…
We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other…
First we consider pair-wise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned,…
Compression models represent an interesting approach for different classification tasks and have been used widely across many research fields. We adapt compression models to the field of authorship verification (AV), a branch of digital…
We propose a computationally light method for estimating similarities between text documents, which we call the density similarity (DS) method. The method is based on a word embedding in a high-dimensional Euclidean space and on kernel…
The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features…
This paper proposes a novel method that can replace compression-based dissimilarity measure (CDM) in composer estimation task. The main features of the proposed method are clarity and scalability. First, since the proposed method is…
Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Most of this data is high dimensional and sparse, for e.g., the bag-of-words representation used for…
Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words and bigrams, while methods based on co-occurrence networks consider the structure of texts regardless of the nodes label (i.e.…
We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a…
In this paper, we propose a way to improve the compression based dissimilarity measure, CDM. We propose to use a modified value of the file size, where the original CDM uses an unmodified file size. Our application is a music score…
A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compression-based…
This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized…
This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the…
The huge amount of information stored in text form makes methods that deal with texts really interesting. This thesis focuses on dealing with texts using compression distances. More specifically, the thesis takes a small step towards…
This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document…
If two probability density functions (PDFs) have values for their first $n$ moments which are quite close to each other (upper bounds of their differences are known), can it be expected that the PDFs themselves are very similar? Shown below…
We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize…