Related papers: Distributed Many-to-Many Protein Sequence Alignmen…

A distributed-memory package for dense Hierarchically Semi-Separable matrix computations using randomization

We present a distributed-memory library for computations with dense structured matrices. A matrix is considered structured if its off-diagonal blocks can be approximated by a rank-deficient matrix with low numerical rank. Here, we use…

Mathematical Software · Computer Science 2015-06-29 François-Henry Rouet , Xiaoye S. Li , Pieter Ghysels , Artem Napov

Extreme-scale many-against-many protein similarity search

Similarity search is one of the most fundamental computations that are regularly performed on ever-increasing protein datasets. Scalability is of paramount importance for uncovering novel phenomena that occur at very large scales. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2023-03-06 Oguz Selvitopi , Saliya Ekanayake , Giulia Guidi , Muaaz G. Awan , Georgios A. Pavlopoulos , Ariful Azad , Nikos Kyrpides , Leonid Oliker , Katherine Yelick , Aydın Buluç

A Space-Efficient Approach towards Distantly Homologous Protein Similarity Searches

Protein similarity searches are a routine job for molecular biologists where a query sequence of amino acids needs to be compared and ranked against an ever-growing database of proteins. All available algorithms in this field can be grouped…

Computational Engineering, Finance, and Science · Computer Science 2015-08-27 Akash Nag , Sunil Karforma

DiffPaSS -- High-performance differentiable pairing of protein sequences using soft scores

Identifying interacting partners from two sets of protein sequences has important applications in computational biology. Interacting partners share similarities across species due to their common evolutionary history, and feature…

Biomolecules · Quantitative Biology 2024-12-31 Umberto Lupo , Damiano Sgarbossa , Martina Milighetti , Anne-Florence Bitbol

Parallel and Scalable Precise Clustering for Homologous Protein Discovery

This paper presents a new, parallel implementation of clustering and demonstrates its utility in greatly speeding up the process of identifying homologous proteins. Clustering is a technique to reduce the number of comparison needed to find…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-08-29 Stuart Byma , Akash Dhasade , Adrian Altenhoff , Christophe Dessimoz , James R. Larus

Distributed-memory Algorithms for Sparse Matrix Permutation, Extraction, and Assignment

We present scalable distributed-memory algorithms for sparse matrix permutation, extraction, and assignment. Our methods follow an Identify-Exchange-Build (IEB) strategy where each process identifies the local nonzeros to be sent, exchanges…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-09-26 Elaheh Hassani , Md Taufique Hussain , Ariful Azad

Indexing Schemes for Similarity Search In Datasets of Short Protein Fragments

We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrix-based similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both…

Data Structures and Algorithms · Computer Science 2007-09-04 Aleksandar Stojmirovic , Vladimir Pestov

A PLMs based protein retrieval framework

Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has…

Information Retrieval · Computer Science 2025-01-06 Yuxuan Wu , Xiao Yi , Yang Tan , Huiqun Yu , Guisheng Fan , Gaowei Zheng

A Learned Index for Exact Similarity Search in Metric Spaces

Indexing is an effective way to support efficient query processing in large databases. Recently the concept of learned index, which replaces or complements traditional index structures with machine learning models, has been actively…

Databases · Computer Science 2022-08-01 Yao Tian , Tingyun Yan , Xi Zhao , Kai Huang , Xiaofang Zhou

PASTA: A Parallel Sparse Tensor Algorithm Benchmark Suite

Tensor methods have gained increasingly attention from various applications, including machine learning, quantum chemistry, healthcare analytics, social network analysis, data mining, and signal processing, to name a few. Sparse tensors and…

Distributed, Parallel, and Cluster Computing · Computer Science 2019-02-12 Jiajia Li , Yuchen Ma , Xiaolong Wu , Ang Li , Kevin Barker

Distributed-Memory Parallel Algorithms for Sparse Matrix and Sparse Tall-and-Skinny Matrix Multiplication

We consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, called TS-SpGEMM, has important applications in multi-source breadth-first search,…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-08-23 Isuru Ranawaka , Md Taufique Hussain , Charles Block , Gerasimos Gerogiannis , Josep Torrellas , Ariful Azad

The Ubiquitous Sparse Matrix-Matrix Products

Multiplication of a sparse matrix with another (dense or sparse) matrix is a fundamental operation that captures the computational patterns of many data science applications, including but not limited to graph algorithms, sparsely connected…

Numerical Analysis · Mathematics 2025-08-07 Aydın Buluç

Scalable Protein Sequence Similarity Search using Locality-Sensitive Hashing and MapReduce

Metagenomics is the study of environments through genetic sampling of their microbiota. Metagenomic studies produce large datasets that are estimated to grow at a faster rate than the available computational capacity. A key step in the…

Distributed, Parallel, and Cluster Computing · Computer Science 2013-10-04 Freddie Sunarso , Srikumar Venugopal , Federico Lauro

DIMS: Distributed Index for Similarity Search in Metric Spaces

Similarity search finds objects that are similar to a given query object based on a similarity metric. As the amount and variety of data continue to grow, similarity search in metric spaces has gained significant attention. Metric spaces…

Databases · Computer Science 2024-10-08 Yifan Zhu , Chengyang Luo , Tang Qian , Lu Chen , Yunjun Gao , Baihua Zheng

Complexes Detection in Biological Networks via Diversified Dense Subgraphs Mining

Protein-protein interaction (PPI) networks, providing a comprehensive landscape of protein interacting patterns, enable us to explore biological processes and cellular components at multiple resolutions. For a biological process, a number…

Molecular Networks · Quantitative Biology 2016-04-13 Xiuli Ma , Guangyu Zhou , Jingjing Wang , Jian Peng , Jiawei Han

SpANNS: Optimizing Approximate Nearest Neighbor Search for Sparse Vectors Using Near Memory Processing

Approximate Nearest Neighbor Search (ANNS) is a fundamental operation in vector databases, enabling efficient similarity search in high-dimensional spaces. While dense ANNS has been optimized using specialized hardware accelerators, sparse…

Databases · Computer Science 2026-01-07 Tianqi Zhang , Flavio Ponzina , Tajana Rosing

SPADE: Faster Drug Discovery by Learning from Sparse Data

Drug discovery seeks molecules (ligands) that bind strongly and selectively to a target protein. However, fewer than 5% of candidate ligands pass the bar for even the early stages of drug discovery. Furthermore, we want methods that work…

Machine Learning · Computer Science 2026-05-08 Rahul Nandakumar , Ben Fauber , Deepayan Chakrabarti

PASS: De novo assembler for short peptide sequences

The ability to characterize proteins at sequence-level resolution is vital to biological research. Currently, the leading method for protein sequencing is by liquid chromatography mass spectrometry (LC-MS) whereas proteins are reduced to…

Genomics · Quantitative Biology 2025-07-11 René L. Warren

Distributed Machine Learning with Sparse Heterogeneous Data

Motivated by distributed machine learning settings such as Federated Learning, we consider the problem of fitting a statistical model across a distributed collection of heterogeneous data sets whose similarity structure is encoded by a…

Statistics Theory · Mathematics 2021-11-30 Dominic Richards , Sahand N. Negahban , Patrick Rebeschini

Sparse integrative clustering of multiple omics data sets

High resolution microarrays and second-generation sequencing platforms are powerful tools to investigate genome-wide alterations in DNA copy number, methylation and gene expression associated with a disease. An integrated genomic profiling…

Applications · Statistics 2013-04-22 Ronglai Shen , Sijian Wang , Qianxing Mo