Related papers: An Asynchronous Distributed-Memory Parallel Algori…

High-Performance Sorting-Based k-mer Counting in Distributed Memory with Flexible Hybrid Parallelism

In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-07-11 Yifan Li , Giulia Guidi

KMC 2: Fast and resource-frugal $k$-mer counting

Motivation: Building the histogram of occurrences of every $k$-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of $k$-mer counting. Its applications include developing de…

Data Structures and Algorithms · Computer Science 2017-03-03 Sebastian Deorowicz , Marek Kokot , Szymon Grabowski , Agnieszka Debudaj-Grabysz

MSPKmerCounter: A Fast and Memory Efficient Approach for K-mer Counting

A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading…

Genomics · Quantitative Biology 2015-05-26 Yang Li , XifengYan

GaKCo: a Fast GApped k-mer string Kernel using COunting

String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we…

Machine Learning · Computer Science 2017-09-19 Ritambhara Singh , Arshdeep Sekhon , Kamran Kowsari , Jack Lanchantin , Beilun Wang , Yanjun Qi

Fast Iteration of Spaced k-mers

Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative…

Genomics · Quantitative Biology 2026-05-15 Lucas Czech

On Simplifying Large-Scale Spatial Vectors: Fast, Memory-Efficient, and Cost-Predictable k-means

The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been…

Machine Learning · Computer Science 2024-12-04 Yushuai Ji , Zepeng Liu , Sheng Wang , Yuan Sun , Zhiyong Peng

Communication-Avoiding Linear Algebraic Kernel K-Means on GPUs

Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel…

Distributed, Parallel, and Cluster Computing · Computer Science 2026-01-29 Julian Bellavita , Matthew Rubino , Nakul Iyer , Andrew Chang , Aditya Devarakonda , Flavio Vella , Giulia Guidi

Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly

De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is…

Distributed, Parallel, and Cluster Computing · Computer Science 2022-07-12 Giulia Guidi , Gabriel Raulet , Daniel Rokhsar , Leonid Oliker , Katherine Yelick , Aydin Buluc

Parallel $k$-Core Decomposition with Batched Updates and Asynchronous Reads

Maintaining a dynamic $k$-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an…

Distributed, Parallel, and Cluster Computing · Computer Science 2024-01-17 Quanquan C. Liu , Julian Shun , Igor Zablotchi

Gerbil: A Fast and Memory-Efficient $k$-mer Counter with GPU-Support

A basic task in bioinformatics is the counting of $k$-mers in genome strings. The $k$-mer counting problem is to build a histogram of all substrings of length $k$ in a given genome sequence. We present the open source $k$-mer counting…

Data Structures and Algorithms · Computer Science 2016-07-25 Marius Erbert , Steffen Rechner , Matthias Müller-Hannemann

Distributed Kernel K-Means for Large Scale Clustering

Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version…

Distributed, Parallel, and Cluster Computing · Computer Science 2017-10-10 Marco Jacopo Ferrarotti , Sergio Decherchi , Walter Rocchia

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of…

Distributed, Parallel, and Cluster Computing · Computer Science 2018-07-05 Umberto Ferraro Petrillo , Mara Sorella , Giuseppe Cattaneo , Raffaele Giancarlo , Simona Rombo

Turtle: Identifying frequent k-mers with cache-efficient algorithms

Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing experiments. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a…

Genomics · Quantitative Biology 2013-05-09 Rajat Shuvro Roy , Debashish Bhattacharya , Alexander Schliep

Redesigning pattern mining algorithms for supercomputers

Upcoming many core processors are expected to employ a distributed memory architecture similar to currently available supercomputers, but parallel pattern mining algorithms amenable to the architecture are not comprehensively studied. We…

Distributed, Parallel, and Cluster Computing · Computer Science 2015-10-28 Kazuki Yoshizoe , Aika Terada , Koji Tsuda

Kmerlight: fast and accurate k-mer abundance estimation

k-mers (nucleotide strings of length k) form the basis of several algorithms in computational genomics. In particular, k-mer abundance information in sequence data is useful in read error correction, parameter estimation for genome…

Data Structures and Algorithms · Computer Science 2016-09-20 Naveen Sivadasan , Rajgopal Srinivasan , Kshama Goyal

KmerCo: A lightweight K-mer counting technique with a tiny memory footprint

K-mer counting is a requisite process for DNA assembly because it speeds up its overall process. The frequency of K-mers is used for estimating the parameters of DNA assembly, error correction, etc. The process also provides a list of…

Databases · Computer Science 2023-05-15 Sabuzima Nayak , Ripon Patgiri

Achievable Information Rates and Concatenated Codes for the DNA Nanopore Sequencing Channel

The errors occurring in DNA-based storage are correlated in nature, which is a direct consequence of the synthesis and sequencing processes. In this paper, we consider the memory-$k$ nanopore channel model recently introduced by Hamoum et…

Information Theory · Computer Science 2023-03-27 Issam Maarouf , Eirik Rosnes , Alexandre Graell i Amat

Distributed-Memory Parallel Algorithms for Fixed-Radius Near Neighbor Graph Construction

Computing fixed-radius near-neighbor graphs is an important first step for many data analysis algorithms. Near-neighbor graphs connect points that are close under some metric, endowing point clouds with a combinatorial structure. As…

Distributed, Parallel, and Cluster Computing · Computer Science 2025-10-17 Gabriel Raulet , Dmitriy Morozov , Aydin Buluc , Katherine Yelick

Dynamic Parallel and Distributed Graph Cuts

Graph-cuts are widely used in computer vision. In order to speed up the optimization process and improve the scalability for large graphs, Strandmark and Kahl introduced a splitting method to split a graph into multiple subgraphs for…

Data Structures and Algorithms · Computer Science 2016-11-03 Miao Yu , Shuhan Shen , Zhanyi Hu

Fast k-means based on KNN Graph

In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well…

Machine Learning · Computer Science 2017-05-05 Cheng-Hao Deng , Wan-Lei Zhao