Inferring phylogenies from genomic data

Besides fundamental scientific interest, the phylogeny of infectious pathogens is also important for understanding and thus ultimately controlling the spread of epidemics. This is increasingly important in modern times in light of recurrent outbreaks (eg. MERS, SARS, COVID19). The accessibility of sequencing technologies mean that phylogenies can be rooted in genomic data, often at near real-time speeds.

Overview

To grossly oversimply things, the basic steps of inferring a phylogeny are:

  1. Obtain and format sequence data

  2. Perform alignment [shell code]

  3. Infer a phylogenetic tree [R code]

  4. Assign time scales

Working with sequence data

In 2020, getting sequence data is very straightforward. There are a number of general sources like NCBI or modENCODE and more specific ones like ViPR BV-BRC, which focuses on viruses. Handling sequence data is readily supported for both R (through Bioconductor’s Biostrings, among others) and Python (through biopython’s Seq class).

Perform sequence alignment

Whether your data is from different species or multiple sampling of the same rapidly-evolving species, the first step in inferring phylogenies is to perform a multiple sequence alignment (MSA). Some of the tools used are Clustal, MAFFT, parsnp, minimap and mummer.

Infer phylogenetic trees

Once you have obtained the MSA, you can construct a phylogenetic tree with one of several algorithms: PHYLIP, IQ-TREE, RAxML, or FastTree. You will need to determine a substitution model to represent how the sequences evolved in your organism(s), though some tools like IQ-TREE can also do this for you as part of tree construction. For more on substitution models, check out Paul Lewis’ excellent primer seminars at the equally excellent PhyloSeminar repository of talks.

At this stage, it’s good to familiarize yourself with the relevant file formats like NEXUS, PHYLIP and Newick. In R, trees can be stored as phylo S3 objects, supported by many phylogenetic packages (eg. ape, phylobase). More recently, treeio implements in the R4 treedata class, which interoperates with the tidytree package, not to be confused with TidyTree, a Javascript tree visualization tool from the CDC.

Phylodynamic modeling

Assigning time scales to branching points, or phylodynamic modeling, can be done using TreeTime, Beast2 or LSD. These algorithms operates on a number of file types, TreeTime, for example, works on FASTA, PHYLIP, and VCF.

Visualization

For visualizing and annotating trees with metadata, the treeio package readily works with ggtree, and the interoperability is very well documented by their author.

Acknowledgements

A fairly comprehensive reference is The Phylogenetic Handbook. Also, much of the information in this blog post wouldn’t be possible without the great bodies of work by my former FredHutch colleagues Erick Matsen and Trevor Bedford. Lastly, my TwinStrand colleague Dan Sommer also gave considerable useful advice.

Written on April 11, 2020