Related papers: Clustering by compression

Generalized Compression Dictionary Distance as Universal Similarity Measure

We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that…

Machine Learning · Statistics 2014-10-22 Andrey Bogomolov , Bruno Lepri , Fabio Pianesi

On Normalized Compression Distance and Large Malware

Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of…

Cryptography and Security · Computer Science 2015-09-03 Rebecca Schuller Borbely

Normalized Compression Distance of Multisets with Applications

Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of…

Computer Vision and Pattern Recognition · Computer Science 2016-01-28 Andrew R. Cohen , Paul M. B. Vitanyi

Context Steering: A New Paradigm for Compression-based Embeddings by Synthesizing Relevant Information Features

Compression-based dissimilarities (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the…

Machine Learning · Computer Science 2026-05-13 Guillermo Sarasa , Ana Granados , Francisco de Borja Rodríguez

Nonapproximablity of the Normalized Information Distance

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2009-10-23 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi

Neural Normalized Compression Distance and the Disconnect Between Compression and Classification

It is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better…

Machine Learning · Computer Science 2024-10-22 John Hurwitz , Charles Nicholas , Edward Raff

Normalized Information Distance

The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to…

Information Retrieval · Computer Science 2008-09-16 Paul M. B. Vitanyi , Frank J. Balbach , Rudi L. Cilibrasi , Ming Li

Algorithmic Clustering based on String Compression to Extract P300 Structure in EEG Signals

P300 is an Event-Related Potential widely used in Brain-Computer Interfaces, but its detection is challenging due to inter-subject and temporal variability. This work introduces a clustering methodology based on Normalized Compression…

Machine Learning · Computer Science 2025-02-04 Guillermo Sarasa , Ana Granados , Francisco B Rodríguez

Discriminative Similarity for Data Clustering

Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose {\em Clustering by Discriminative…

Machine Learning · Computer Science 2022-06-24 Yingzhen Yang , Ping Li

Categorical Data Clustering via Value Order Estimated Distance Metric Learning

Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space…

Machine Learning · Computer Science 2025-09-01 Yiqun Zhang , Mingjie Zhao , Hong Jia , Yang Lu , Mengke Li , Yiu-ming Cheung

A Universal Non-Parametric Approach For Improved Molecular Sequence Analysis

In the field of biological research, it is essential to comprehend the characteristics and functions of molecular sequences. The classification of molecular sequences has seen widespread use of neural network-based techniques. Despite their…

Machine Learning · Computer Science 2024-02-14 Sarwan Ali , Tamkanat E Ali , Prakash Chourasia , Murray Patterson

Crowdsourced correlation clustering with relative distance comparisons

Crowdsourced, or human computation based clustering algorithms usually rely on relative distance comparisons, as these are easier to elicit from human workers than absolute distance information. A relative distance comparison is a statement…

Data Structures and Algorithms · Computer Science 2017-09-26 Antti Ukkonen

The similarity metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…

Computational Complexity · Computer Science 2011-11-09 Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitanyi

Clustering categorical data via ensembling dissimilarity matrices

We present a technique for clustering categorical data by generating many dissimilarity matrices and averaging over them. We begin by demonstrating our technique on low dimensional categorical data and comparing it to several other…

Machine Learning · Statistics 2017-09-20 Saeid Amiri , Bertrand Clarke , Jennifer Clarke

Algorithmic Clustering of Music

We present a fully automatic method for music classification, based only on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without…

Sound · Computer Science 2016-08-31 Rudi Cilibrasi , Paul Vitanyi , Ronald de Wolf

Hierarchical Clustering Supported by Reciprocal Nearest Neighbors

Clustering is a fundamental analysis tool aiming at classifying data points into groups based on their similarity or distance. It has found successful applications in all natural and social sciences, including biology, physics, economics,…

Information Retrieval · Computer Science 2021-02-24 Wen-Bo Xie , Yan-Li Lee , Cong Wang , Duan-Bing Chen , Tao Zhou

Hierarchical Graph Clustering using Node Pair Sampling

We present a novel hierarchical graph clustering algorithm inspired by modularity-based clustering techniques. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs.…

Social and Information Networks · Computer Science 2018-06-25 Thomas Bonald , Bertrand Charpentier , Alexis Galland , Alexandre Hollocou

DECWA : Density-Based Clustering using Wasserstein Distance

Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters.…

Machine Learning · Computer Science 2023-10-26 Nabil El Malki , Robin Cugny , Olivier Teste , Franck Ravat

Convex Clustering: An Attractive Alternative to Hierarchical Clustering

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its…

Genomics · Quantitative Biology 2018-06-07 Gary K. Chen , Eric Chi , John Ranola , Kenneth Lange

Normalized Information Distance is Not Semicomputable

Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…

Computational Complexity · Computer Science 2010-06-17 Sebastiaan A. Terwijn , Leen Torenvliet , Paul M. B. Vitanyi