Related papers: Authorship Analysis based on Data Compression

A fast compression-based similarity measure with applications to content-based image retrieval

Compression-based similarity measures are effectively employed in applications on diverse data types with a basically parameter-free approach. Nevertheless, there are problems in applying these techniques to medium-to-large datasets which…

Machine Learning · Statistics 2012-10-03 Daniele Cerra , Mihai Datcu

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric…

Information Retrieval · Computer Science 2023-07-24 Peter Coates , Frank Breitinger

A Novel Patent Similarity Measurement Methodology: Semantic Distance and Technological Distance

Patent similarity analysis plays a crucial role in evaluating the risk of patent infringement. Nonetheless, this analysis is predominantly conducted manually by legal experts, often resulting in a time-consuming process. Recent advances in…

Information Retrieval · Computer Science 2023-12-04 Yongmin Yoo , Cheonkam Jeong , Sanguk Gim , Junwon Lee , Zachary Schimke , Deaho Seo

Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship

We adapt the Higher Criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other…

Computation and Language · Computer Science 2023-10-03 Alon Kipnis

Compression-based Similarity

First we consider pair-wise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned,…

Information Theory · Computer Science 2011-10-21 Paul M. B. Vitanyi

Authorship Verification based on Compression-Models

Compression models represent an interesting approach for different classification tasks and have been used widely across many research fields. We adapt compression models to the field of authorship verification (AV), a branch of digital…

Information Retrieval · Computer Science 2017-06-05 Oren Halvani , Christian Winter , Lukas Graner

We propose a computationally light method for estimating similarities between text documents, which we call the density similarity (DS) method. The method is based on a word embedding in a high-dimensional Euclidean space and on kernel…

Computation and Language · Computer Science 2020-09-03 Ilia Rushkin

A Comparison of Semantic Similarity Methods for Maximum Human Interpretability

The inclusion of semantic information in any similarity measures improves the efficiency of the similarity measure and provides human interpretable results for further analysis. The similarity calculation method that focuses on features…

Information Retrieval · Computer Science 2019-11-01 Pinky Sitikhu , Kritish Pahi , Pujan Thapa , Subarna Shakya

Computing Information Quantity as Similarity Measure for Music Classification Task

This paper proposes a novel method that can replace compression-based dissimilarity measure (CDM) in composer estimation task. The main features of the proposed method are clarity and scalability. First, since the proposed method is…

Sound · Computer Science 2018-04-17 Ayaka Takamoto , Mitsuo Yoshida , Kyoji Umemura , Yuko Ichikawa

Efficient Compression Technique for Sparse Sets

Recent technological advancements have led to the generation of huge amounts of data over the web, such as text, image, audio and video. Most of this data is high dimensional and sparse, for e.g., the bag-of-words representation used for…

Information Theory · Computer Science 2017-08-17 Rameshwar Pratap , Ishan Sohony , Raghav Kulkarni

On the role of words in the network structure of texts: application to authorship attribution

Well-established automatic analyses of texts mainly consider frequencies of linguistic units, e.g. letters, words and bigrams, while methods based on co-occurrence networks consider the structure of texts regardless of the nodes label (i.e.…

Computation and Language · Computer Science 2018-02-27 Camilo Akimushkin , Diego R. Amancio , Osvaldo N. Oliveira

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Rudi Cilibrasi , Paul Vitanyi

Improving Compression Based Dissimilarity Measure for Music Score Analysis

In this paper, we propose a way to improve the compression based dissimilarity measure, CDM. We propose to use a modified value of the file size, where the original CDM uses an unmodified file size. Our application is a music score…

Sound · Computer Science 2017-10-05 Ayaka Takamoto , Mayu Umemura , Mitsuo Yoshida , Kyoji Umemura

A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compression-based…

Computer Vision and Pattern Recognition · Computer Science 2019-09-30 Tanaya Guha , Rabab K. Ward

An Enhancement of Jiang, Z., et al.s Compression-Based Classification Algorithm Applied to News Article Categorization

This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized…

Computation and Language · Computer Science 2025-02-21 Sean Lester C. Benavides , Cid Antonio F. Masapol , Jonathan C. Morano , Dan Michael A. Cortez

This paper presents a new approach for measuring semantic similarity/distance between words and concepts. It combines a lexical taxonomy structure with corpus statistical information so that the semantic distance between nodes in the…

cmp-lg · Computer Science 2008-02-03 Jay J. Jiang , David W. Conrath

Analysis and study on text representation to improve the accuracy of the Normalized Compression Distance

The huge amount of information stored in text form makes methods that deal with texts really interesting. This thesis focuses on dealing with texts using compression distances. More specifically, the thesis takes a small step towards…

Information Theory · Computer Science 2012-05-30 Ana Granados

Comparative Document Analysis for Large Text Corpora

This paper presents a novel research problem on joint discovery of commonalities and differences between two individual documents (or document sets), called Comparative Document Analysis (CDA). Given any pair of documents from a document…

Information Retrieval · Computer Science 2015-10-27 Xiang Ren , Yuanhua Lv , Kuansan Wang , Jiawei Han

Bounded Statistics

If two probability density functions (PDFs) have values for their first $n$ moments which are quite close to each other (upper bounds of their differences are known), can it be expected that the PDFs themselves are very similar? Shown below…

Statistics Theory · Mathematics 2018-08-16 Pranava Chaitanya Jayanti , Konstantina Trivisa

Text Categorization via Similarity Search: An Efficient and Effective Novel Algorithm

We present a supervised learning algorithm for text categorization which has brought the team of authors the 2nd place in the text categorization division of the 2012 Cybersecurity Data Mining Competition (CDMC'2012) and a 3rd prize…

Information Retrieval · Computer Science 2013-07-11 Hubert Haoyang Duan , Vladimir Pestov , Varun Singla