Related papers: An Improved Filtering Algorithm for Big Read Datas…
In this paper, we propose a semi-supervised deep learning method for detecting the specific types of reads that impede the de novo genome assembly process. Instead of dealing directly with sequenced reads, we analyze their coverage graphs…
Adequate read filtering is critical when processing high-throughput data in marker-gene-based studies. Sequencing errors can cause the mis-clustering of otherwise similar reads, artificially increasing the number of retrieved Operational…
Metagenomic studies have increasingly utilized sequencing technologies in order to analyze DNA fragments found in environmental samples.One important step in this analysis is the taxonomic classification of the DNA fragments. Conventional…
Motivation: Next generation methods of DNA sequencing produce relatively high rate of reading errors, which interfere with de novo genome assembly of newly sequenced organisms and particularly affect the quality of SNP detection important…
Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is…
As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential…
Motivation: Seed filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. Read mappers 1) quickly…
Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read…
High read depth can be used to assemble short sequence repeats. The existing genome assemblers fail in repetitive regions of longer than average read. I propose a new algorithm for a DNA assembly which uses the relative frequency of reads…
DNA has immense potential as an emerging data storage medium. The principle of DNA storage is the conversion and flow of digital information between binary code stream, quaternary base, and actual DNA fragments. This process will inevitably…
Next-generation sequencing technologies generate millions of short sequence reads, which are usually aligned to a reference genome. In many applications, the key information required for downstream analysis is the number of reads mapping to…
While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read…
Motivation: With the development of third-generation sequencing technologies, people are able to obtain DNA sequences with lengths from 10s to 100s of kb. These long reads allow protein domain annotation without assembly, thus can produce…
The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck.…
The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires…
The extraction of $k$-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all…
This study proposes a data condensation method for multivariate kernel density estimation by genetic algorithm. First, our proposed algorithm generates multiple subsamples of a given size with replacement from the original sample. The…
When selecting data for training large-scale models, standard practice is to filter for examples that match human notions of data quality. Such filtering yields qualitatively clean datapoints that intuitively should improve model behavior.…
Deep learning's success has been attributed to the training of large, overparameterized models on massive amounts of data. As this trend continues, model training has become prohibitively costly, requiring access to powerful computing…
The prevalent technique for DNA sequencing consists of two main steps: shotgun sequencing, where many randomly located fragments, called reads, are extracted from the overall sequence, followed by an assembly algorithm that aims to…