Related papers: Counting common substrings effectively

Comparing Two Counting Methods for Estimating the Probabilities of Strings

There are two methods for counting the number of occurrences of a string in another large string. One is to count the number of places where the string is found. The other is to determine how many pieces of string can be extracted without…

Data Structures and Algorithms · Computer Science 2022-11-09 Ayaka Takamoto , Mitsuo Yoshida , Kyoji Umemura

A Novel String Distance Function based on Most Frequent K Characters

This study aims to publish a novel similarity metric to increase the speed of comparison operations. Also the new metric is suitable for distance-based operations among strings. Most of the simple calculation methods, such as string length…

Data Structures and Algorithms · Computer Science 2014-01-28 Sadi Evren Seker , Oguz Altun , Uğur Ayan , Cihan Mert

Computing the Number of Longest Common Subsequences

This note provides very simple, efficient algorithms for computing the number of distinct longest common subsequences of two input strings and for computing the number of LCS embeddings.

Data Structures and Algorithms · Computer Science 2007-05-23 Ronald I. Greenberg

Efficient Approximation Algorithms for String Kernel Based Sequence Classification

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between $k$-mers ($k$-length subsequences) in the…

Data Structures and Algorithms · Computer Science 2017-12-13 Muhammad Farhan , Juvaria Tariq , Arif Zaman , Mudassir Shabbir , Imdad Ullah Khan

Detecting Machine-Translated Paragraphs by Matching Similar Words

Machine-translated text plays an important role in modern life by smoothing communication from various communities using different languages. However, unnatural translation may lead to misunderstanding, a detector is thus needed to avoid…

Computation and Language · Computer Science 2019-04-25 Hoang-Quoc Nguyen-Son , Tran Phuong Thao , Seira Hidano , Shinsaku Kiyomoto

Improved Algorithms for Approximate String Matching (Extended Abstract)

The problem of approximate string matching is important in many different areas such as computational biology, text processing and pattern recognition. A great effort has been made to design efficient algorithms addressing several variants…

Data Structures and Algorithms · Computer Science 2008-07-29 Dimitris Papamichail , Georgios Papamichail

Scalable Methods for Calculating Term Co-Occurrence Frequencies

Search techniques make use of elementary information such as term frequencies and document lengths in computation of similarity weighting. They can also exploit richer statistics, in particular the number of documents in which any two terms…

Information Retrieval · Computer Science 2020-07-20 Bodo Billerbeck , Justin Zobel , Nicholas Lester , Nick Craswell

$LCSk$++: Practical similarity metric for long strings

In this paper we present $LCSk$++: a new metric for measuring the similarity of long strings, and provide an algorithm for its efficient computation. With ever increasing size of strings occuring in practice, e.g. large genomes of plants…

Data Structures and Algorithms · Computer Science 2019-08-27 Filip Pavetić , Goran Žužić , Mile Šikić

Mining Statistically Significant Substrings Based on the Chi-Square Measure

Given the vast reservoirs of data stored worldwide, efficient mining of data from a large information store has emerged as a great challenge. Many databases like that of intrusion detection systems, web-click records, player statistics,…

Databases · Computer Science 2010-03-09 Sourav Dutta , Arnab Bhattacharya

Faster Algorithm of String Comparison

In many applications, it is necessary to determine the string similarity. Edit distance[WF74] approach is a classic method to determine Field Similarity. A well known dynamic programming algorithm [GUS97] is used to calculate edit distance…

Data Structures and Algorithms · Computer Science 2007-05-23 Qi Xiao Yang , Sung Sam Yuan , Lu Chun , Li Zhao , Sun Peng

Combining a Context Aware Neural Network with a Denoising Autoencoder for Measuring String Similarities

Measuring similarities between strings is central for many established and fast growing research areas including information retrieval, biology, and natural language processing. The traditional approach for string similarity measurements is…

Information Retrieval · Computer Science 2018-08-20 Mehdi Ben Lazreg , Morten Goodwin

Detecting $k$-(Sub-)Cadences and Equidistant Subsequence Occurrences

The equidistant subsequence pattern matching problem is considered. Given a pattern string $P$ and a text string $T$, we say that $P$ is an \emph{equidistant subsequence} of $T$ if $P$ is a subsequence of the text such that consecutive…

Data Structures and Algorithms · Computer Science 2020-02-18 Mitsuru Funakoshi , Yuto Nakashima , Shunsuke Inenaga , Hideo Bannai , Masayuki Takeda , Ayumi Shinohara

Many Flavors of Edit Distance

Several measures exist for string similarity, including notable ones like the edit distance and the indel distance. The former measures the count of insertions, deletions, and substitutions required to transform one string into another,…

Data Structures and Algorithms · Computer Science 2024-10-15 Sudatta Bhattacharya , Sanjana Dey , Elazar Goldenberg , Michal Koucký

The problem of measuring similarity of graphs and their nodes is important in a range of practical problems. There is a number of proposed measures, some of them being based on iterative calculation of similarity between two graphs and the…

Artificial Intelligence · Computer Science 2010-09-28 Mladen Nikolic

Modular Subset Sum, Dynamic Strings, and Zero-Sum Sets

The modular subset sum problem consists of deciding, given a modulus $m$, a multiset $S$ of $n$ integers in $0..m-1$, and a target integer $t$, whether there exists a subset of $S$ with elements summing to $t \mod m $, and to report such a…

Data Structures and Algorithms · Computer Science 2023-10-27 Jean Cardinal , John Iacono

Maximizing Diversity in (near-)Median String Selection

Given a set of strings over a specified alphabet, identifying a median or consensus string that minimizes the total distance to all input strings is a fundamental data aggregation problem. When the Hamming distance is considered as the…

Data Structures and Algorithms · Computer Science 2026-02-11 Diptarka Chakraborty , Rudrayan Kundu , Nidhi Purohit , Aravinda Kanchana Ruwanpathirana

Speeding Up String Matching by Weak Factor Recognition

String matching is the problem of finding all the substrings of a text which match a given pattern. It is one of the most investigated problems in computer science, mainly due to its very diverse applications in several fields. Recently,…

Data Structures and Algorithms · Computer Science 2017-07-04 Domenico Cantone , Simone Faro , Arianna Pavone

Faster Binary Mean Computation Under Dynamic Time Warping

Many consensus string problems are based on Hamming distance. We replace Hamming distance by the more flexible (e.g., easily coping with different input string lengths) dynamic time warping distance, best known from applications in time…

Discrete Mathematics · Computer Science 2020-02-05 Nathan Schaar , Vincent Froese , Rolf Niedermeier

Systolic Array Technique for Determining Common Approximate Substrings

A technique using a systolic array structure is proposed for solving the common approximate substring (CAS) problem. This approach extends the technique introduced in earlier work from the computation of the edit-distance between two…

Data Structures and Algorithms · Computer Science 2010-06-08 Jacqueline E. Rice , Kenneth B. Kent

Parallel Algorithms for Densest Subgraph Discovery Using Shared Memory Model

The problem of finding dense components of a graph is a widely explored area in data analysis, with diverse applications in fields and branches of study including community mining, spam detection, computer security and bioinformatics. This…

Information Retrieval · Computer Science 2021-03-02 B. D. M. De Zoysa , Y. A. M. M. A. Ali , M. D. I. Maduranga , Indika Perera , Saliya Ekanayake , Anil Vullikanti