Related papers: Normalized Information Distance

Nonapproximablity of the Normalized Information Distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2009-10-23 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2010-06-17 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Information Distance in Multiples

Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering, and classification. The notion of information distance is extended from pairs to multiples…

Computer Vision and Pattern Recognition · Computer Science 2009-05-21 Paul M. B. Vitanyi

Generalized Compression Dictionary Distance as Universal Similarity Measure

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that…

Machine Learning · Statistics 2014-10-22 Andrey Bogomolov , Bruno Lepri , Fabio Pianesi

Perceptually Inspired Normalized Conditional Compression Distance

Image similarity measurement is a common issue in a broad range of applications in image processing, recognition, classification and retrieval. Conventional image similarity measures are often limited to specific applications and cannot be…

Image and Video Processing · Electrical Eng. & Systems 2019-05-09 Nima Nikvand , Zhou Wang , Xavier Fernando , Wisam Farjow

Properties of Algorithmic Information Distance

The domain-independent universal Normalized Information Distance based on Kolmogorov complexity has been (in approximate form) successfully applied to a variety of difficult clustering problems. In this paper we investigate theoretical…

Information Theory · Computer Science 2025-07-30 Marcus Hutter

Algorithms for Estimating Information Distance with Application to Bioinformatics and Linguistics

After reviewing unnormalized and normalized information distances based on incomputable notions of Kolmogorov complexity, we discuss how Kolmogorov complexity can be approximated by data compression algorithms. We argue that optimal…

Computational Complexity · Computer Science 2007-05-23 Alexei Kaltchenko

Information Distance

While Kolmogorov complexity is the accepted absolute measure of information content in an individual finite object, a similarly absolute notion is needed for the information distance between two individual objects, for example, two…

Information Theory · Computer Science 2010-06-18 Charles H. Bennett , Peter Gacs , Ming Li , Paul M. B. Vitanyi , Wojciech H. Zurek

Normalized Compression Distance of Multisets with Applications

Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of…

Computer Vision and Pattern Recognition · Computer Science 2016-01-28 Andrew R. Cohen , Paul M. B. Vitanyi

The similarity metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…

Computational Complexity · Computer Science 2011-11-09 Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitanyi

Normalized Google Distance of Multisets with Applications

Normalized Google distance (NGD) is a relative semantic distance based on the World Wide Web (or any other large electronic database, for instance Wikipedia) and a search engine that returns aggregate page counts. The earlier NGD between…

Information Retrieval · Computer Science 2013-08-15 Andrew R. Cohen , P. M. B. Vitanyi

Normalized web distance (NWD) is a similarity or normalized semantic distance based on the World Wide Web or another large electronic database, for instance Wikipedia, and a search engine that returns reliable aggregate page counts. For…

Information Retrieval · Computer Science 2020-07-24 Andrew R. Cohen , Paul M. B. Vitanyi

Clustering by compression

We present a new method for clustering based on compression. The method doesn't use subject-specific features or background knowledge, and works as follows: First, we determine a universal similarity distance, the normalized compression…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Rudi Cilibrasi , Paul Vitanyi

Compression-based Similarity

First we consider pair-wise distances for literal objects consisting of finite binary files. These files are taken to contain all of their meaning, like genomes or books. The distances are based on compression of the objects concerned,…

Information Theory · Computer Science 2011-10-21 Paul M. B. Vitanyi

Evaluating the Impact of Information Distortion on Normalized Compression Distance

In this paper we apply different techniques of information distortion on a set of classical books written in English. We study the impact that these distortions have upon the Kolmogorov complexity and the clustering by compression technique…

Information Theory · Computer Science 2008-05-09 Ana Granados , Manuel Cebrian , David Camacho , Francisco de B. Rodriguez

On Normalized Compression Distance and Large Malware

Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of…

Cryptography and Security · Computer Science 2015-09-03 Rebecca Schuller Borbely

We survey the emerging area of compression-based, parameter-free, similarity distance measures useful in data-mining, pattern recognition, learning and automatic semantics extraction. Given a family of distances on a set of objects, a…

Computer Vision and Pattern Recognition · Computer Science 2007-05-23 Rudi Cilibrasi , Paul Vitanyi

Tiny, Hardware-Independent, Compression-based Classification

The recent developments in machine learning have highlighted a conflict between online platforms and their users in terms of privacy. The importance of user privacy and the struggle for power over user data has been intensified as…

Machine Learning · Computer Science 2026-03-09 Charles Meyers , Aaron MacSween , Erik Elmroth , Tommy Löfstedt

Normalized information-based divergences

This paper is devoted to the mathematical study of some divergences based on the mutual information well-suited to categorical random vectors. These divergences are generalizations of the "entropy distance" and "information distance". Their…

Statistics Theory · Mathematics 2016-08-16 Jean-François Coeurjolly , Rémy Drouilhet , Jean-François Robineau

A Consolidated Approach to Convolutional Neural Networks and the Kolmogorov Complexity

The ability to precisely quantify similarity between various entities has been a fundamental complication in various problem spaces specifically in the classification of cellular images. Contemporary similarity measures applied in the…

Computer Vision and Pattern Recognition · Computer Science 2018-12-04 D Yoan L. Mekontchou Yomba