Related papers: Efficient seeding techniques for protein similarit…
We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We…
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated…
We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated…
We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt…
Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for…
While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein.…
The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications. We propose a rigorous and flexible algorithmic solution to mapping SOLiD…
This paper describes a method to efficiently retrieve protein database sequences similar to a query sequence, while allowing for significant numbers of mutations. We call this method SEQR for SEQuence Retrieval. This approach increases the…
Randomised algorithms often employ methods that can fail and that are retried with independent randomness until they succeed. Randomised data structures therefore often store indices of successful attempts, called seeds. If $n$ such seeds…
Protein similarity searches are a routine job for molecular biologists where a query sequence of amino acids needs to be compared and ranked against an ever-growing database of proteins. All available algorithms in this field can be grouped…
We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both…
Screening or assessing studies is critical to the quality and outcomes of a systematic review. Typically, a Boolean query retrieves the set of studies to screen. As the set of studies retrieved is unordered, screening all retrieved studies…
Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that…
We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton,…
The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a…
Suffix trees have recently become very successful data structures in handling large data sequences such as DNA or Protein sequences. Consequently parallel architectures have become ubiquitous. We present a novel alphabet-dependent parallel…
The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being…
Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is…
Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs…
As the structural databases continue to expand, efficient methods are required to search similar structures of the query structure from the database. There are many previous works about comparing protein 3D structures and scanning the…