Related papers: Embed-Search-Align: DNA Sequence Alignment using T…

Unaligned Sequence Similarity Search Using Deep Learning

Gene annotation has traditionally required direct comparison of DNA sequences between an unknown gene and a database of known ones using string comparison methods. However, these methods do not provide useful information when a gene does…

Machine Learning · Computer Science 2019-09-17 James K. Senter , Taylor M. Royalty , Andrew D. Steen , Amir Sadovnik

Single-Read Reconstruction for DNA Data Storage Using Transformers

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential…

Emerging Technologies · Computer Science 2021-10-12 Yotam Nahum , Eyar Ben-Tolila , Leon Anavy

DNA data storage, sequencing data-carrying DNA

DNA is a leading candidate as the next archival storage media due to its density, durability and sustainability. To read (and write) data DNA storage exploits technology that has been developed over decades to sequence naturally occurring…

Emerging Technologies · Computer Science 2022-05-12 Jasmine Quah , Omer Sella , Thomas Heinis

A new DNA alignment method based on inverted index

This paper presents a novel DNA sequences alignment method based on inverted index. Now most large scale information retrieval system are all use inverted index as the basic data structure. But its application in DNA sequence alignment is…

Genomics · Quantitative Biology 2013-07-02 Wang Liang , Zhao KaiYong

Blind Biological Sequence Denoising with Self-Supervised Set Learning

Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate…

Genomics · Quantitative Biology 2023-09-06 Nathan Ng , Ji Won Park , Jae Hyeon Lee , Ryan Lewis Kelly , Stephen Ra , Kyunghyun Cho

Small Coupling Expansion for Multiple Sequence Alignment

The alignment of biological sequences such as DNA, RNA, and proteins, is one of the basic tools that allow to detect evolutionary patterns, as well as functional/structural characterizations between homologous sequences in different…

Quantitative Methods · Quantitative Biology 2023-05-01 Louise Budzynski , Andrea Pagnani

Vector Embeddings by Sequence Similarity and Context for Improved Compression, Similarity Search, Clustering, Organization, and Manipulation of cDNA Libraries

This paper demonstrates the utility of organized numerical representations of genes in research involving flat string gene formats (i.e., FASTA/FASTQ5). FASTA/FASTQ files have several current limitations, such as their large file sizes,…

Genomics · Quantitative Biology 2023-08-11 Daniel H. Um , David A. Knowles , Gail E. Kaiser

Error-Correcting Codes for Labeled DNA Sequences

Labeling of DNA molecules is a fundamental technique for DNA visualization and analysis. This process was mathematically modeled in [1], where the received sequence indicates the positions of the used labels. In this work, we develop error…

Information Theory · Computer Science 2025-11-04 Dganit Hanania , Eitan Yaakobi

Align then Train: Efficient Retrieval Adapter Learning

Dense retrieval systems increasingly need to handle complex queries. In many realistic settings, users express intent through long instructions or task-specific descriptions, while target documents remain relatively simple and static. This…

Information Retrieval · Computer Science 2026-04-07 Seiji Maekawa , Moin Aminnaseri , Pouya Pezeshkpour , Estevam Hruschka

Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM

Summary: BWA-MEM is a new alignment algorithm for aligning sequence reads or long query sequences against a large reference genome such as human. It automatically chooses between local and end-to-end alignments, supports paired-end reads…

Genomics · Quantitative Biology 2013-05-28 Heng Li

Fixed-Length Protein Embeddings using Contextual Lenses

The Basic Local Alignment Search Tool (BLAST) is currently the most popular method for searching databases of biological sequences. BLAST compares sequences via similarity defined by a weighted edit distance, which results in it being…

Biomolecules · Quantitative Biology 2020-10-29 Amir Shanehsazzadeh , David Belanger , David Dohan

New Sequence Alignment Algorithm using AI Rules and Dynamic Seeds

DNA sequence alignment is important today as it is usually the first step in finding gene mutation, evolutionary similarities, protein structure, drug development and cancer treatment. Covid-19 is one recent example. There are many…

Genomics · Quantitative Biology 2023-06-01 Suchindra , Preetam Nagaraj

Large-scale Machine Learning for Metagenomics Sequence Classification

Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is…

Quantitative Methods · Quantitative Biology 2015-05-27 Kévin Vervier , Pierre Mahé , Maud Tournoud , Jean-Baptiste Veyrieras , Jean-Philippe Vert

FPGA Acceleration of Sequence Alignment: A Survey

Genomics is changing our understanding of humans, evolution, diseases, and medicines to name but a few. As sequencing technology is developed collecting DNA sequences takes less time thereby generating more genetic data every day. Today the…

Quantitative Methods · Quantitative Biology 2020-07-29 Sahand Salamat , Tajana Rosing

Lerna: Transformer Architectures for Configuring Error Correction Tools for Short- and Long-Read Genome Sequencing

Sequencing technologies are prone to errors, making error correction (EC) necessary for downstream applications. EC tools need to be manually configured for optimal performance. We find that the optimal parameters (e.g., k-mer size) are…

Genomics · Quantitative Biology 2021-12-21 Atul Sharma , Pranjal Jain , Ashraf Mahgoub , Zihan Zhou , Kanak Mahadik , Somali Chaterji

Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification

Alignment-based sequence similarity searches, while accurate for some type of sequences, can produce incorrect results when used on more divergent but functionally related sequences that have undergone the sequence rearrangements observed…

Genomics · Quantitative Biology 2015-01-21 Ivan Borozan , Stuart Watt , Vincent Ferretti

BEND: Benchmarking DNA Language Models on biologically meaningful tasks

The genome sequence contains the blueprint for governing cellular processes. While the availability of genomes has vastly increased over the last decades, experimental annotation of the various functional, non-coding and regulatory elements…

Genomics · Quantitative Biology 2024-04-10 Frederikke Isa Marin , Felix Teufel , Marc Horlacher , Dennis Madsen , Dennis Pultz , Ole Winther , Wouter Boomsma

Iterative Learning for Reference-Guided DNA Sequence Assembly from Short Reads: Algorithms and Limits of Performance

Recent emergence of next-generation DNA sequencing technology has enabled acquisition of genetic information at unprecedented scales. In order to determine the genetic blueprint of an organism, sequencing platforms typically employ…

Genomics · Quantitative Biology 2015-06-19 Xiaohu Shen , Manohar Shamaiah , Haris Vikalo

Learning protein sequence embeddings using information from structure

Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for…

Machine Learning · Computer Science 2019-10-17 Tristan Bepler , Bonnie Berger

In Search of Lost DNA Sequence Pretraining

DNA sequence encoding is fundamental to gene function prediction, protein synthesis, and diverse downstream biological tasks. Despite the substantial progress achieved by large-scale DNA sequence pretraining, existing studies have…

Machine Learning · Computer Science 2026-04-21 Zhijiang Tang , Jiaxin Qi , Yan Cui , Jinli Ou , Yuhua Zheng , Jianqiang Huang