Inferring phylogenies from genomic data
Besides fundamental scientific interest, the phylogeny of infectious pathogens is also important for understanding and thus ultimately controlling the spread of epidemics. This is increasingly important in modern times in light of recurrent outbreaks (eg. MERS, SARS, COVID19). The accessibility of sequencing technologies mean that phylogenies can be rooted in genomic data, often at near real-time speeds.
Overview
To grossly oversimply things, the basic steps of inferring a phylogeny are:
-
Obtain and format sequence data
-
Perform alignment [shell code]
-
Infer a phylogenetic tree [R code]
-
Assign time scales
Working with sequence data
In 2020, getting sequence data is very straightforward. There are a number of general sources like NCBI or modENCODE and more specific ones like ViPR BV-BRC, which focuses on viruses. Handling sequence data is readily supported for both R (through Bioconductor’s Biostrings, among others) and Python (through biopython’s Seq class).
Perform sequence alignment
Whether your data is from different species or multiple sampling of the same rapidly-evolving species, the first step in inferring phylogenies is to perform a multiple sequence alignment (MSA). Some of the tools used are Clustal, MAFFT, parsnp, minimap and mummer.
Infer phylogenetic trees
Once you have obtained the MSA, you can construct a phylogenetic tree with one of several algorithms: PHYLIP, IQ-TREE, RAxML, or FastTree. You will need to determine a substitution model to represent how the sequences evolved in your organism(s), though some tools like IQ-TREE can also do this for you as part of tree construction. For more on substitution models, check out Paul Lewis’ excellent primer seminars at the equally excellent PhyloSeminar repository of talks.
At this stage, it’s good to familiarize yourself with the relevant file formats like NEXUS, PHYLIP and Newick. In R, trees can be stored as phylo
S3 objects, supported by many phylogenetic packages (eg. ape, phylobase). More recently, treeio implements in the R4 treedata
class, which interoperates with the tidytree package, not to be confused with TidyTree, a Javascript tree visualization tool from the CDC.
Phylodynamic modeling
Assigning time scales to branching points, or phylodynamic modeling, can be done using TreeTime, Beast2 or LSD. These algorithms operates on a number of file types, TreeTime
, for example, works on FASTA, PHYLIP, and VCF.
Visualization
For visualizing and annotating trees with metadata, the treeio
package readily works with ggtree
, and the interoperability is very well documented by their author.
Acknowledgements
A fairly comprehensive reference is The Phylogenetic Handbook. Also, much of the information in this blog post wouldn’t be possible without the great bodies of work by my former FredHutch colleagues Erick Matsen and Trevor Bedford. Lastly, my TwinStrand colleague Dan Sommer also gave considerable useful advice.