Related papers: AnySeq: A High Performance Sequence Alignment Libr…
In recent years, the rapidly increasing number of reads produced by next-generation sequencing (NGS) technologies has driven the demand for efficient implementations of sequence alignments in bioinformatics. However, current…
The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to…
State-of-the-art multiple sequence alignment (MSA) algorithms are based on progressive approaches that rely on pairwise sequence alignment (PSA) to generate guide trees to align all sequences. Given an evidenced explosion in genomic data…
Recent advances in computational methods for designing biological sequences have sparked the development of metrics to evaluate these methods performance in terms of the fidelity of the designed sequences to a target distribution and their…
PyamilySeq is a Python-based tool designed for interpretable gene clustering and pangenomic inference, supporting analyses at both species and genus levels. It facilitates the clustering of gene sequences into families based on sequence…
fairseq is an open-source sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling, and other text generation tasks. The toolkit is based on PyTorch and…
Motivation: Recent advances in sequencing technologies promise ultra-long reads of $\sim$100 kilo bases (kb) in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 mega bases (Mb) in length. Existing…
Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python…
Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in…
Genome sequences contain hundreds of millions of DNA base pairs. Finding the degree of similarity between two genomes requires executing a compute-intensive dynamic programming algorithm, such as Smith-Waterman. Traditional von Neumann…
DNA sequence alignment is important today as it is usually the first step in finding gene mutation, evolutionary similarities, protein structure, drug development and cancer treatment. Covid-19 is one recent example. There are many…
Designing novel protein sequences for a desired 3D topological fold is a fundamental yet non-trivial task in protein engineering. Challenges exist due to the complex sequence--fold relationship, as well as the difficulties to capture the…
Ancestral sequence reconstruction is a key task in computational biology. It consists in inferring a molecular sequence at an ancestral species of a known phylogeny, given descendant sequences at the tip of the tree. In addition to its many…
Motivation: The mapping of RNA-seq reads to their transcripts of origin is a fundamental task in transcript expression estimation and differential expression scoring. Where ambiguities in mapping exist due to transcripts sharing sequence,…
Evolutionary sparse learning (ESL) uses a supervised machine learning approach, Least Absolute Shrinkage and Selection Operator (LASSO), to build models explaining the relationship between a hypothesis and the variation across genomic…
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data…
High throughput sequencing of RNA (RNA-Seq) can provide us with millions of short fragments of RNA transcripts from a sample. How to better recover the original RNA transcripts from those fragments (RNA-Seq assembly) is still a difficult…
Artificial intelligence (AI) tools are gaining more and more ground each year in bioinformatics. Learning algorithms can be taught easily by using the existing enormous biological databases, and the resulting models can be used for the…
Genome sequence analysis plays a pivotal role in enabling many medical and scientific advancements in personalized medicine, outbreak tracing, and forensics. However, the analysis of genome sequencing data is currently bottlenecked by the…
Computational complexity is a key limitation of genomic analyses. Thus, over the last 30 years, researchers have proposed numerous fast heuristic methods that provide computational relief. Comparing genomic sequences is one of the most…