Related papers: An Asynchronous Distributed-Memory Parallel Algori…
In generating large quantities of DNA data, high-throughput sequencing technologies require advanced bioinformatics infrastructures for efficient data analysis. k-mer counting, the process of quantifying the frequency of fixed-length k DNA…
Motivation: Building the histogram of occurrences of every $k$-symbol long substring of nucleotide data is a standard step in many bioinformatics applications, known under the name of $k$-mer counting. Its applications include developing de…
A major challenge in next-generation genome sequencing (NGS) is to assemble massive overlapping short reads that are randomly sampled from DNA fragments. To complete assembling, one needs to finish a fundamental task in many leading…
String Kernel (SK) techniques, especially those using gapped $k$-mers as features (gk), have obtained great success in classifying sequences like DNA, protein, and text. However, the state-of-the-art gk-SK runs extremely slow when we…
Background: Short sequence substrings of a fixed length k, called k-mers, are a ubiquitous computational primitive in bioinformatics, used across sequence indexing, read mapping, genome assembly, metagenomic classification, and comparative…
The k-means algorithm can simplify large-scale spatial vectors, such as 2D geo-locations and 3D point clouds, to support fast analytics and learning. However, when processing large-scale datasets, existing k-means algorithms have been…
Clustering is an important tool in data analysis, with K-means being popular for its simplicity and versatility. However, it cannot handle non-linearly separable clusters. Kernel K-means addresses this limitation but requires a large kernel…
De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is…
Maintaining a dynamic $k$-core decomposition is an important problem that identifies dense subgraphs in dynamically changing graphs. Recent work by Liu et al. [SPAA 2022] presents a parallel batch-dynamic algorithm for maintaining an…
A basic task in bioinformatics is the counting of $k$-mers in genome strings. The $k$-mer counting problem is to build a histogram of all substrings of length $k$ in a given genome sequence. We present the open source $k$-mer counting…
Clustering samples according to an effective metric and/or vector space representation is a challenging unsupervised learning task with a wide spectrum of applications. Among several clustering algorithms, k-means and its kernelized version…
Distributed approaches based on the map-reduce programming paradigm have started to be proposed in the bioinformatics domain, due to the large amount of data produced by the next-generation sequencing techniques. However, the use of…
Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing experiments. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a…
Upcoming many core processors are expected to employ a distributed memory architecture similar to currently available supercomputers, but parallel pattern mining algorithms amenable to the architecture are not comprehensively studied. We…
k-mers (nucleotide strings of length k) form the basis of several algorithms in computational genomics. In particular, k-mer abundance information in sequence data is useful in read error correction, parameter estimation for genome…
K-mer counting is a requisite process for DNA assembly because it speeds up its overall process. The frequency of K-mers is used for estimating the parameters of DNA assembly, error correction, etc. The process also provides a list of…
The errors occurring in DNA-based storage are correlated in nature, which is a direct consequence of the synthesis and sequencing processes. In this paper, we consider the memory-$k$ nanopore channel model recently introduced by Hamoum et…
Computing fixed-radius near-neighbor graphs is an important first step for many data analysis algorithms. Near-neighbor graphs connect points that are close under some metric, endowing point clouds with a combinatorial structure. As…
Graph-cuts are widely used in computer vision. In order to speed up the optimization process and improve the scalability for large graphs, Strandmark and Kahl introduced a splitting method to split a graph into multiple subgraphs for…
In the era of big data, k-means clustering has been widely adopted as a basic processing tool in various contexts. However, its computational cost could be prohibitively high as the data size and the cluster number are large. It is well…