Related papers: Efficient seeding techniques for protein similarit…

On subset seeds for protein alignment

We apply the concept of subset seeds proposed in [1] to similarity search in protein sequences. The main question studied is the design of efficient seed alphabets to construct seeds with optimal sensitivity/selectivity trade-offs. We…

Quantitative Methods · Quantitative Biology 2011-01-18 Mikhail A. Roytberg , Anna Gambin , Laurent Noé , Slawomir Lasota , Eugenia Furletova , Ewa Szczurek , Gregory Kucherov

A unifying framework for seed sensitivity and its application to subset seeds

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem -- a set of target alignments, an associated…

Data Structures and Algorithms · Computer Science 2010-01-19 Gregory Kucherov , Laurent Noé , Mihkail Roytberg

A unifying framework for seed sensitivity and its application to subset seeds (Extended abstract)

We propose a general approach to compute the seed sensitivity, that can be applied to different definitions of seeds. It treats separately three components of the seed sensitivity problem - a set of target alignments, an associated…

Other Computer Science · Computer Science 2011-01-18 Gregory Kucherov , Laurent Noe , Mikhail Roytberg

Multiseed Lossless Filtration

We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt…

Quantitative Methods · Quantitative Biology 2011-01-18 Gregory Kucherov , Laurent Noé , Mikhail A. Roytberg

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for…

Genomics · Quantitative Biology 2023-05-24 Can Firtina , Jisung Park , Mohammed Alser , Jeremie S. Kim , Damla Senol Cali , Taha Shahroodi , Nika Mansouri Ghiasi , Gagandeep Singh , Konstantinos Kanellopoulos , Can Alkan , Onur Mutlu

Tree Search-Based Evolutionary Bandits for Protein Sequence Optimization

While modern biotechnologies allow synthesizing new proteins and function measurements at scale, efficiently exploring a protein sequence space and engineering it remains a daunting task due to the vast sequence space of any given protein.…

Biomolecules · Quantitative Biology 2024-01-15 Jiahao Qiu , Hui Yuan , Jinghong Zhang , Wentao Chen , Huazheng Wang , Mengdi Wang

Seed design framework for mapping SOLiD reads

The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications. We propose a rigorous and flexible algorithmic solution to mapping SOLiD…

Quantitative Methods · Quantitative Biology 2011-01-18 Laurent Noé , Marta L. Gîrdea , Gregory Kucherov

Searching by index for similar sequences: the SEQR algorithm

This paper describes a method to efficiently retrieve protein database sequences similar to a query sequence, while allowing for significant numbers of mutations. We call this method SEQR for SEQuence Retrieval. This approach increases the…

Genomics · Quantitative Biology 2018-11-05 David I. Hurwitz , Lianyi Han , Lewis Y. Geer

Combined Search and Encoding for Seeds, with an Application to Minimal Perfect Hashing

Randomised algorithms often employ methods that can fail and that are retried with independent randomness until they succeed. Randomised data structures therefore often store indices of successful attempts, called seeds. If $n$ such seeds…

Data Structures and Algorithms · Computer Science 2025-07-03 Hans-Peter Lehmann , Peter Sanders , Stefan Walzer , Jonatan Ziegler

A Space-Efficient Approach towards Distantly Homologous Protein Similarity Searches

Protein similarity searches are a routine job for molecular biologists where a query sequence of amino acids needs to be compared and ranked against an ever-growing database of proteins. All available algorithms in this field can be grouped…

Computational Engineering, Finance, and Science · Computer Science 2015-08-27 Akash Nag , Sunil Karforma

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both…

Data Structures and Algorithms · Computer Science 2007-09-04 Aleksandar Stojmirovic , Vladimir Pestov

Seed-driven Document Ranking for Systematic Reviews: A Reproducibility Study

Screening or assessing studies is critical to the quality and outcomes of a systematic review. Typically, a Boolean query retrieves the set of studies to screen. As the set of studies retrieved is unordered, screening all retrieved studies…

Information Retrieval · Computer Science 2021-12-09 Shuai Wang , Harrisen Scells , Ahmed Mourad , Guido Zuccon

Languages of lossless seeds

Several algorithms for similarity search employ seeding techniques to quickly discard very dissimilar regions. In this paper, we study theoretical properties of lossless seeds, i.e., spaced seeds having full sensitivity. We prove that…

Discrete Mathematics · Computer Science 2014-05-23 Karel Břinda

Subset seed automaton

We study the pattern matching automaton introduced in (A unifying framework for seed sensitivity and its application to subset seeds) for the purpose of seed-based similarity search. We show that our definition provides a compact automaton,…

Formal Languages and Automata Theory · Computer Science 2014-08-27 Gregory Kucherov , Laurent Noé , Mikhail Roytberg

Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN

The exponential growth of DNA sequencing data has outpaced traditional heuristic-based methods, which struggle to scale effectively. Efficient computational approaches are urgently needed to support large-scale similarity search, a…

Genomics · Quantitative Biology 2025-07-24 Mohammad Saleh Refahi , Gavin Hearne , Harrison Muller , Kieran Lynch , Bahrad A. Sokhansanj , James R. Brown , Gail Rosen

Alphabet-dependent Parallel Algorithm for Suffix Tree Construction for Pattern Searching

Suffix trees have recently become very successful data structures in handling large data sequences such as DNA or Protein sequences. Consequently parallel architectures have become ubiquitous. We present a novel alphabet-dependent parallel…

Data Structures and Algorithms · Computer Science 2017-04-20 Freeson Kaniwa , Venu Madhav Kuthadi , Otlhapile Dinakenyane , Heiko Schroeder

Fixed-Length Protein Embeddings using Contextual Lenses

The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being…

Biomolecules · Quantitative Biology 2020-10-29 Amir Shanehsazzadeh , David Belanger , David Dohan

Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification

Recent advances in weakly supervised text classification mostly focus on designing sophisticated methods to turn high-level human heuristics into quality pseudo-labels. In this paper, we revisit the seed matching-based method, which is…

Computation and Language · Computer Science 2023-10-24 Chengyu Dong , Zihan Wang , Jingbo Shang

LES3: Learning-based Exact Set Similarity Search

Set similarity search is a problem of central interest to a wide variety of applications such as data cleaning and web search. Past approaches on set similarity search utilize either heavy indexing structures, incurring large search costs…

Databases · Computer Science 2021-07-23 Yifan Li , Xiaohui Yu , Nick Koudas

Efficient and scalable geometric hashing method for searching protein 3D structures

As the structural databases continue to expand, efficient methods are required to search similar structures of the query structure from the database. There are many previous works about comparing protein 3D structures and scanning the…

Databases · Computer Science 2011-02-16 Gook-Pil Roh , Seung-won Hwang , Byoung-Kee Yi