Related papers: Learning from String Sequences

Efficient Approximation Algorithms for String Kernel Based Sequence Classification

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between $k$-mers ($k$-length subsequences) in the…

Data Structures and Algorithms · Computer Science 2017-12-13 Muhammad Farhan , Juvaria Tariq , Arif Zaman , Mudassir Shabbir , Imdad Ullah Khan

Compression ratios based on the Universal Similarity Metric still yield protein distances far from CATH distances

Kolmogorov complexity has inspired several alignment-free distance measures, based on the comparison of lengths of compressions, which have been applied successfully in many areas. One of these measures, the so-called Universal Similarity…

Quantitative Methods · Quantitative Biology 2011-11-10 Jairo Rocha , Francesc Rosselló , Joan Segura

Faster Algorithm of String Comparison

In many applications, it is necessary to determine the string similarity. Edit distance[WF74] approach is a classic method to determine Field Similarity. A well known dynamic programming algorithm [GUS97] is used to calculate edit distance…

Data Structures and Algorithms · Computer Science 2007-05-23 Qi Xiao Yang , Sung Sam Yuan , Lu Chun , Li Zhao , Sun Peng

Dimensionality Invariant Similarity Measure

This paper presents a new similarity measure to be used for general tasks including supervised learning, which is represented by the K-nearest neighbor classifier (KNN). The proposed similarity measure is invariant to large differences in…

Machine Learning · Computer Science 2014-09-04 Ahmad Basheer Hassanat

A Novel String Distance Function based on Most Frequent K Characters

This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length…

Data Structures and Algorithms · Computer Science 2014-01-28 Sadi Evren Seker , Oguz Altun , Uğur Ayan , Cihan Mert

SPELUNKER: Item Similarity Search Using Large Language Models and Custom K-Nearest Neighbors

This paper presents a hybrid system for intuitive item similarity search that combines a Large Language Model (LLM) with a custom K-Nearest Neighbors (KNN) algorithm. Unlike black-box dense vector systems, this architecture provides…

Information Retrieval · Computer Science 2025-09-29 Ana Rodrigues , João Mata , Rui Rego

Adaptive Nearest Neighbor: A General Framework for Distance Metric Learning

$K$-NN classifier is one of the most famous classification algorithms, whose performance is crucially dependent on the distance metric. When we consider the distance metric as a parameter of $K$-NN, learning an appropriate distance metric…

Machine Learning · Computer Science 2019-11-26 Kun Song

Metric learning by Similarity Network for Deep Semi-Supervised Learning

Deep semi-supervised learning has been widely implemented in the real-world due to the rapid development of deep learning. Recently, attention has shifted to the approaches such as Mean-Teacher to penalize the inconsistency between two…

Machine Learning · Statistics 2020-04-30 Sanyou Wu , Xingdong Feng , Fan Zhou

Combining a Context Aware Neural Network with a Denoising Autoencoder for Measuring String Similarities

Measuring similarities between strings is central for many established and fast growing research areas including information retrieval, biology, and natural language processing. The traditional approach for string similarity measurements is…

Information Retrieval · Computer Science 2018-08-20 Mehdi Ben Lazreg , Morten Goodwin

dna2vec: Consistent vector representations of variable-length k-mers

One of the ubiquitous representation of long DNA sequence is dividing it into shorter k-mer components. Unfortunately, the straightforward vector encoding of k-mer as a one-hot vector is vulnerable to the curse of dimensionality. Worse yet,…

Quantitative Methods · Quantitative Biology 2017-01-24 Patrick Ng

Generalization through Memorization: Nearest Neighbor Language Models

We introduce $k$NN-LMs, which extend a pre-trained neural language model (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding…

Computation and Language · Computer Science 2020-02-18 Urvashi Khandelwal , Omer Levy , Dan Jurafsky , Luke Zettlemoyer , Mike Lewis

Study and Observation of the Variation of Accuracies of KNN, SVM, LMNN, ENN Algorithms on Eleven Different Datasets from UCI Machine Learning Repository

Machine learning qualifies computers to assimilate with data, without being solely programmed [1, 2]. Machine learning can be classified as supervised and unsupervised learning. In supervised learning, computers learn an objective that…

Machine Learning · Computer Science 2019-02-06 Mohammad Mahmudur Rahman Khan , Rezoana Bente Arif , Md. Abu Bakr Siddique , Mahjabin Rahman Oishe

Improved Algorithms for Approximate String Matching (Extended Abstract)

The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. A great effort has been made to design efficient algorithms addressing several variants…

Data Structures and Algorithms · Computer Science 2008-07-29 Dimitris Papamichail , Georgios Papamichail

Assessing the Unitary RNN as an End-to-End Compositional Model of Syntax

We show that both an LSTM and a unitary-evolution recurrent neural network (URN) can achieve encouraging accuracy on two types of syntactic patterns: context-free long distance agreement, and mildly context-sensitive cross serial…

Computation and Language · Computer Science 2022-08-12 Jean-Philippe Bernardy , Shalom Lappin

Thresholding of Semantic Similarity Networks using a Spectral Graph Based Technique

Semantic similarity measures (SSMs) refer to a set of algorithms used to quantify the similarity of two or more terms belonging to the same ontology. Ontology terms may be associated to concepts, for instance in computational biology gene…

Molecular Networks · Quantitative Biology 2013-05-22 Pietro Hiram Guzzi , Simone Truglia , Pierangelo Veltri , Mario Cannataro

Deep Distance Measurement Method for Unsupervised Multivariate Time Series Similarity Retrieval

We propose the Deep Distance Measurement Method (DDMM) to improve retrieval accuracy in unsupervised multivariate time series similarity retrieval. DDMM enables learning of minute differences within states in the entire time series and…

Machine Learning · Computer Science 2026-03-16 Susumu Naito , Kouta Nakata , Yasunori Taguchi

More and more neural network approaches have achieved considerable improvement upon submodules of speaker diarization system, including speaker change detection and segment-wise speaker embedding extraction. Still, in the clustering stage,…

Audio and Speech Processing · Electrical Eng. & Systems 2019-12-02 Qingjian Lin , Ruiqing Yin , Ming Li , Hervé Bredin , Claude Barras

Proposal and study of statistical features for string similarity computation and classification

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The…

Machine Learning · Computer Science 2026-05-15 E. O. Rodrigues , D. Casanova , M. Teixeira , V. Pegorini , F. Favarim , E. Clua , A. Conci , Panos Liatsis

k-Nearest Neighbour Classification of Datasets with a Family of Distances

The $k$-nearest neighbour ($k$-NN) classifier is one of the oldest and most important supervised learning algorithms for classifying datasets. Traditionally the Euclidean norm is used as the distance for the $k$-NN classifier. In this…

Machine Learning · Statistics 2015-12-02 Stan Hatko

The similarity metric

A new class of distances appropriate for measuring similarity relations between sequences, say one type of similarity per distance, is studied. We propose a new ``normalized information distance'', based on the noncomputable notion of…

Computational Complexity · Computer Science 2011-11-09 Ming Li , Xin Chen , Xin Li , Bin Ma , Paul Vitanyi