Related papers: Compression-based Similarity

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Rudi Cilibrasi , Paul Vitanyi

Universal Similarity

We survey a new area of parameter-free similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a…

Information Retrieval · Computer Science 2007-05-23 Paul Vitanyi

Information Distance in Multiples

Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples…

Computer Vision and Pattern Recognition · Computer Science 2009-05-21 Paul M. B. Vitanyi

Normalized Information Distance

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to…

Information Retrieval · Computer Science 2008-09-16 Paul M. B. Vitanyi , Frank J. Balbach , Rudi L. Cilibrasi , Ming Li

The similarity metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…

Computational Complexity · Computer Science 2011-11-09 Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitanyi

The Google Similarity Distance

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the…

Computation and Language · Computer Science 2007-06-13 Rudi Cilibrasi , Paul M. B. Vitanyi

Nonapproximablity of the Normalized Information Distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2009-10-23 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric…

Information Retrieval · Computer Science 2023-07-24 Peter Coates , Frank Breitinger

A fast compression-based similarity measure with applications to content-based image retrieval

Compression-based similarity measures are effectively employed in applications on diverse data types with a basically parameter-free approach. Nevertheless, there are problems in applying these techniques to medium-to-large datasets which…

Machine Learning · Statistics 2012-10-03 Daniele Cerra , Mihai Datcu

A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve…

Information Retrieval · Computer Science 2009-10-12 A. A. Krizhanovsky , Feiyu Lin

Information Distance

While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two…

Information Theory · Computer Science 2010-06-18 Charles H. Bennett , Peter Gacs , Ming Li , Paul M. B. Vitanyi , Wojciech H. Zurek

A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compression-based…

Computer Vision and Pattern Recognition · Computer Science 2019-09-30 Tanaya Guha , Rabab K. Ward

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2010-06-17 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Normalized web distance (NWD) is a similarity or normalized semantic distance based on the World Wide Web or another large electronic database, for instance Wikipedia, and a search engine that returns reliable aggregate page counts. For…

Information Retrieval · Computer Science 2020-07-24 Andrew R. Cohen , Paul M. B. Vitanyi

Generalized Compression Dictionary Distance as Universal Similarity Measure

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that…

Machine Learning · Statistics 2014-10-22 Andrey Bogomolov , Bruno Lepri , Fabio Pianesi

The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such…

Data Structures and Algorithms · Computer Science 2016-12-20 Raghav Kulkarni , Rameshwar Pratap

Combinatorial information distance

Let $|A|$ denote the cardinality of a finite set $A$. For any real number $x$ define $t(x)=x$ if $x\geq1$ and 1 otherwise. For any finite sets $A,B$ let $\delta(A,B)$ $=$ $\log_{2}(t(|B\cap\bar{A}||A|))$. We define {This appears as…

Discrete Mathematics · Computer Science 2010-10-19 Joel Ratsaby

Algorithms for Estimating Information Distance with Application to Bioinformatics and Linguistics

After reviewing unnormalized and normalized information distances based on incomputable notions of Kolmogorov complexity, we discuss how Kolmogorov complexity can be approximated by data compression algorithms. We argue that optimal…

Computational Complexity · Computer Science 2007-05-23 Alexei Kaltchenko

Google distance between words

Cilibrasi and Vitanyi have demonstrated that it is possible to extract the meaning of words from the world-wide web. To achieve this, they rely on the number of webpages that are found through a Google search containing a given word and…

Computation and Language · Computer Science 2015-01-29 Bjørn Kjos-Hanssen , Alberto J. Evangelista

The Minimal Compression Rate for Similarity Identification

Traditionally, data compression deals with the problem of concisely representing a data source, e.g. a sequence of letters, for the purpose of eventual reproduction (either exact or approximate). In this work we are interested in the case…

Information Theory · Computer Science 2013-12-10 Amir Ingber , Tsachy Weissman