Related papers: Clustering by compression
We present a new similarity measure based on information theoretic measures which is superior than Normalized Compression Distance for clustering problems and inherits the useful properties of conditional Kolmogorov complexity. We show that…
Normalized Compression Distance (NCD) is a popular tool that uses compression algorithms to cluster and classify data in a wide range of applications. Existing discussions of NCD's theoretical merit rely on certain theoretical properties of…
Normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity measure between a pair of finite objects based on compression. However, it is not sufficient for all applications. We propose an NCD of…
Compression-based dissimilarities (CD) offer a flexible and domain-agnostic means of measuring similarity by identifying implicit information through redundancies between data objects. However, as similarity features are derived from the…
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…
It is generally well understood that predictive classification and compression are intrinsically related concepts in information theory. Indeed, many deep learning methods are explained as learning a kind of compression, and that better…
The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to…
P300 is an Event-Related Potential widely used in Brain-Computer Interfaces, but its detection is challenging due to inter-subject and temporal variability. This work introduces a clustering methodology based on Normalized Compression…
Similarity-based clustering methods separate data into clusters according to the pairwise similarity between the data, and the pairwise similarity is crucial for their performance. In this paper, we propose {\em Clustering by Discriminative…
Clustering is a popular machine learning technique for data mining that can process and analyze datasets to automatically reveal sample distribution patterns. Since the ubiquitous categorical data naturally lack a well-defined metric space…
In the field of biological research, it is essential to comprehend the characteristics and functions of molecular sequences. The classification of molecular sequences has seen widespread use of neural network-based techniques. Despite their…
Crowdsourced, or human computation based clustering algorithms usually rely on relative distance comparisons, as these are easier to elicit from human workers than absolute distance information. A relative distance comparison is a statement…
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…
We present a technique for clustering categorical data by generating many dissimilarity matrices and averaging over them. We begin by demonstrating our technique on low dimensional categorical data and comparing it to several other…
We present a fully automatic method for music classification, based only on compression of strings that represent the music pieces. The method uses no background knowledge about music whatsoever: it is completely general and can, without…
Clustering is a fundamental analysis tool aiming at classifying data points into groups based on their similarity or distance. It has found successful applications in all natural and social sciences, including biology, physics, economics,…
We present a novel hierarchical graph clustering algorithm inspired by modularity-based clustering techniques. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs.…
Clustering is a data analysis method for extracting knowledge by discovering groups of data called clusters. Among these methods, state-of-the-art density-based clustering methods have proven to be effective for arbitrary-shaped clusters.…
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its…
Normalized information distance (NID) uses the theoretical notion of Kolmogorov complexity, which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.…