English

Phylo2Vec: a vector representation for binary trees

Populations and Evolution 2025-03-26 v5 Machine Learning Quantitative Methods

Abstract

Binary phylogenetic trees inferred from biological data are central to understanding the shared history among evolutionary units. However, inferring the placement of latent nodes in a tree is computationally expensive. State-of-the-art methods rely on carefully designed heuristics for tree search, using different data structures for easy manipulation (e.g., classes in object-oriented programming languages) and readable representation of trees (e.g., Newick-format strings). Here, we present Phylo2Vec, a parsimonious encoding for phylogenetic trees that serves as a unified approach for both manipulating and representing phylogenetic trees. Phylo2Vec maps any binary tree with nn leaves to a unique integer vector of length n1n-1. The advantages of Phylo2Vec are fourfold: i) fast tree sampling, (ii) compressed tree representation compared to a Newick string, iii) quick and unambiguous verification if two binary trees are identical topologically, and iv) systematic ability to traverse tree space in very large or small jumps. As a proof of concept, we use Phylo2Vec for maximum likelihood inference on five real-world datasets and show that a simple hill-climbing-based optimisation scheme can efficiently traverse the vastness of tree space from a random to an optimal tree.

Keywords

Cite

@article{arxiv.2304.12693,
  title  = {Phylo2Vec: a vector representation for binary trees},
  author = {Matthew J Penn and Neil Scheidwasser and Mark P Khurana and David A Duchêne and Christl A Donnelly and Samir Bhatt},
  journal= {arXiv preprint arXiv:2304.12693},
  year   = {2025}
}

Comments

38 pages, 9 figures, 1 table, 2 supplementary figures