English
Related papers

Related papers: Compression-based Similarity

200 papers

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Rudi Cilibrasi , Paul Vitanyi

We survey a new area of parameter-free similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a distance is universal up to a…

Information Retrieval · Computer Science 2007-05-23 Paul Vitanyi

Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples…

Computer Vision and Pattern Recognition · Computer Science 2009-05-21 Paul M. B. Vitanyi

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to…

Information Retrieval · Computer Science 2008-09-16 Paul M. B. Vitanyi , Frank J. Balbach , Rudi L. Cilibrasi , Ming Li

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…

Computational Complexity · Computer Science 2011-11-09 Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitanyi

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of `society' is `database,' and the equivalent of `use' is `way to search the…

Computation and Language · Computer Science 2007-06-13 Rudi Cilibrasi , Paul M. B. Vitanyi

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2009-10-23 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Identifying document similarity has many applications, e.g., source code analysis or plagiarism detection. However, identifying similarities is not trivial and can be time complex. For instance, the Levenshtein Distance is a common metric…

Information Retrieval · Computer Science 2023-07-24 Peter Coates , Frank Breitinger

Compression-based similarity measures are effectively employed in applications on diverse data types with a basically parameter-free approach. Nevertheless, there are problems in applying these techniques to medium-to-large datasets which…

Machine Learning · Statistics 2012-10-03 Daniele Cerra , Mihai Datcu

A set of ontology matching algorithms (for finding correspondences between concepts) is based on a thesaurus that provides the source data for the semantic distance calculations. In this wiki era, new resources may spring up and improve…

Information Retrieval · Computer Science 2009-10-12 A. A. Krizhanovsky , Feiyu Lin

While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two…

Information Theory · Computer Science 2010-06-18 Charles H. Bennett , Peter Gacs , Ming Li , Paul M. B. Vitanyi , Wojciech H. Zurek

A new line of research uses compression methods to measure the similarity between signals. Two signals are considered similar if one can be compressed significantly when the information of the other is known. The existing compression-based…

Computer Vision and Pattern Recognition · Computer Science 2019-09-30 Tanaya Guha , Rabab K. Ward

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2010-06-17 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Normalized web distance (NWD) is a similarity or normalized semantic distance based on the World Wide Web or another large electronic database, for instance Wikipedia, and a search engine that returns reliable aggregate page counts. For…

Information Retrieval · Computer Science 2020-07-24 Andrew R. Cohen , Paul M. B. Vitanyi

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that…

Machine Learning · Statistics 2014-10-22 Andrey Bogomolov , Bruno Lepri , Fabio Pianesi

The rise of internet has resulted in an explosion of data consisting of millions of articles, images, songs, and videos. Most of this data is high dimensional and sparse. The need to perform an efficient search for similar objects in such…

Data Structures and Algorithms · Computer Science 2016-12-20 Raghav Kulkarni , Rameshwar Pratap

Let $|A|$ denote the cardinality of a finite set $A$. For any real number $x$ define $t(x)=x$ if $x\geq1$ and 1 otherwise. For any finite sets $A,B$ let $\delta(A,B)$ $=$ $\log_{2}(t(|B\cap\bar{A}||A|))$. We define {This appears as…

Discrete Mathematics · Computer Science 2010-10-19 Joel Ratsaby

After reviewing unnormalized and normalized information distances based on incomputable notions of Kolmogorov complexity, we discuss how Kolmogorov complexity can be approximated by data compression algorithms. We argue that optimal…

Computational Complexity · Computer Science 2007-05-23 Alexei Kaltchenko

Cilibrasi and Vitanyi have demonstrated that it is possible to extract the meaning of words from the world-wide web. To achieve this, they rely on the number of webpages that are found through a Google search containing a given word and…

Computation and Language · Computer Science 2015-01-29 Bjørn Kjos-Hanssen , Alberto J. Evangelista

Traditionally, data compression deals with the problem of concisely representing a data source, e.g. a sequence of letters, for the purpose of eventual reproduction (either exact or approximate). In this work we are interested in the case…

Information Theory · Computer Science 2013-12-10 Amir Ingber , Tsachy Weissman
‹ Prev 1 2 3 10 Next ›