Related papers: Minimum error correction-based haplotype assembly:…
We present several new results pertaining to haplotyping. These results concern the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype fragments. We consider the complexity of the…
The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the…
In this paper we present a collection of results pertaining to haplotyping. The first set of results concerns the combinatorial problem of reconstructing haplotypes from incomplete and/or imperfectly sequenced haplotype data. More…
This paper studies the haplotype assembly problem from an information theoretic perspective. A haplotype is a sequence of nucleotide bases on a chromosome, often conveniently represented by a binary string, that differ from the bases in the…
Recent matrix completion based methods have not been able to properly model the Haplotype Assembly Problem (HAP) for noisy observations. To cope with such a case, in this letter we propose a new Minimum Error Correction (MEC) based matrix…
Background: Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual's susceptibility to hereditary and complex diseases and affect how our…
Humans have $23$ pairs of homologous chromosomes. The homologous pairs are almost identical pairs of chromosomes. For the most part, differences in homologous chromosome occur at certain documented positions called single nucleotide…
DNA samples are often pooled, either by experimental design, or because the sample itself is a mixture. For example, when population allele frequencies are of primary interest, individual samples may be pooled together to lower the cost of…
Multimodal data is a precious asset enabling a variety of downstream tasks in machine learning. However, real-world data collected across different modalities is often not paired, which is a significant challenge to learn a joint…
High read depth can be used to assemble short sequence repeats. The existing genome assemblers fail in repetitive regions of longer than average read. I propose a new algorithm for a DNA assembly which uses the relative frequency of reads…
Chromosomal DNA is characterized by variation between individuals at the level of entire chromosomes (e.g., aneuploidy in which the chromosome copy number is altered), segmental changes (including insertions, deletions, inversions, and…
A new optimization design is proposed for matrix completion by weighting the measurements and deriving the corresponding error bound. Accordingly, the Haplotype reconstruction using nuclear norm minimization with Weighted Constraint…
Short-read DNA sequencing instruments can yield over 1e+12 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using…
Self-supervised pre-training methods have brought remarkable breakthroughs in the understanding of text, image, and speech. Recent developments in genomics has also adopted these pre-training methods for genome understanding. However, they…
Various studies have shown an association between single nucleotide polymorphisms (SNPs) and common disease. We hypothesize that information encoded in the structure of SNP haploblock variation illumines molecular pathways and cellular…
We consider the problem Minimum Error Correction (MEC). A MEC instance is an n x m matrix M with entries from {0,1,-}. Feasible solutions are composed of two binary m-bit strings, together with an assignment of each row of M to one of the…
The Minimum-Recombinant Haplotype Configuration problem (MRHC) has been highly successful in providing a sound combinatorial formulation for the important problem of genotype phasing on pedigrees. Despite several algorithmic advances and…
Owing to its several merits over other DNA sequencing technologies, nanopore sequencers hold an immense potential to revolutionize the efficiency of DNA storage systems. However, their higher error rates necessitate further research to…
We describe an algorithm to find maximal exact matches (MEMs) among HiFi reads with homopolymer errors. The main novelty in our work is that we resort to run-length compression to help deal with errors. Our method receives as input a…
The application of standard sufficient dimension reduction methods for reducing the dimension space of predictors without losing regression information requires inverting the covariance matrix of the predictors. This has posed a number of…