Related papers: Distance approximation using Isolation Forests
The isolation forest algorithm for outlier detection exploits a simple yet effective observation: if taking some multivariate data and making uniformly random cuts across the feature space recursively, it will take fewer such random cuts…
Isolation forest or "iForest" is an intuitive and widely used algorithm for anomaly detection that follows a simple yet effective idea: in a given data distribution, if a threshold (split point) is selected uniformly at random within the…
With predictive models becoming prevalent, companies are expanding the types of data they gather. As a result, the collected datasets consist not only of simple numerical features but also more complex objects such as time series, images,…
We describe the use of an unsupervised Random Forest for similarity learning and improved unsupervised anomaly detection. By training a Random Forest to discriminate between real data and synthetic data sampled from a uniform distribution…
The wealth of data being gathered about humans and their surroundings drives new machine learning applications in various fields. Consequently, more and more often, classifiers are trained using not only numerical data but also complex data…
Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector, primarily owing to its remarkable runtime efficiency and superior performance in large-scale tasks. Despite its widespread adoption, a theoretical…
The concepts of similarity and distance are crucial in data mining. We consider the problem of defining the distance between two data sets by comparing summary statistics computed from the data sets. The initial definition of our distance…
In this article, we propose tree edit distance with variables, which is an extension of the tree edit distance to handle trees with variables and has a potential application to measuring the similarity between mathematical formulas,…
Metric learning makes it plausible to learn distances for complex distributions of data from labeled data. However, to date, most metric learning methods are based on a single Mahalanobis metric, which cannot handle heterogeneous data well.…
A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…
Similarity search is an important problem in information retrieval. This similarity is based on a distance. Symbolic representation of time series has attracted many researchers recently, since it reduces the dimensionality of these high…
Isolation forest (iForest) has been emerging as arguably the most popular anomaly detector in recent years due to its general effectiveness across different benchmarks and strong scalability. Nevertheless, its linear axis-parallel isolation…
In this work we study the interleaving distance between merge trees from a combinatorial point of view. We use a particular type of matching between trees to obtain a novel formulation of the distance. With such formulation, we tackle the…
We make two contributions to the Isolation Forest method for anomaly and outlier detection. The first contribution is an information-theoretically motivated generalisation of the score function that is used to aggregate the scores across…
Compared to theoretical frameworks that assume equal sensitivity to deviations in all features of data, the theory of anomaly detection allowing for variable sensitivity across features is less developed. To the best of our knowledge, this…
We study the problem of how well a tree metric is able to preserve the sum of pairwise distances of an arbitrary metric. This problem is closely related to low-stretch metric embeddings and is interesting by its own flavor from the line of…
Clustering is a fundamental approach to understanding data patterns, wherein the intuitive Euclidean distance space is commonly adopted. However, this is not the case for implicit cluster distributions reflected by qualitative attribute…
Assume we are given a set of items from a general metric space, but we neither have access to the representation of the data nor to the distances between data points. Instead, suppose that we can actively choose a triplet of items (A,B,C)…
A common approach to implementing similarity search applications is the usage of distance functions, where small distances indicate high similarity. In the case of metric distance functions, metric index structures can be used to accelerate…
For many optimization problems it is possible to define a distance metric between problem variables that correlates with the likelihood and strength of interactions between the variables. For example, one may define a metric so that the…