Workflow managers for bioinformatic pipelines

Due to the growth of data, workflow managers (eg. Airflow, Prefect) have been growing in popularity. In bioinformatics this popularity is further accelerated by the replication crisis in science. Workflow managers automate routine tasks while also ensuring reproducibility by enabling drop-in changes in data, runtime parameters, or even entire toolchains. At Fred Hutch where I used to work, Nextflow and Cromwell were most popular. Elsewhere, Snakemake is also popular though I don’t have much personal experience with it.

Nextflow

Nextflow is based on an extended version of Groovy and is now on the second version. My former colleague Sam Minot set up the FredHutch Nextflow demos leveraging his background in viral metagenomics.

Simple tasks like batch-converting PNGs into JPEGs only took me a few minutes. Writing a simple DataCarpentry genomic workflow into Nextflow DSL2 is a weekend project.

Cromwell

Cromwell has a bit more infrastructure. Workflows use the Workflow Description Language (WDL) but the syntax is also pretty simple and there is a linting tool for troubleshooting. Running Cromwell requires a database backend to store jobs that you interface with using a Swagger REST API. As a result, the logs are fairly detailed, which aids troubleshooting when jobs fail. My colleague Amy Paguirigan has done much work for supporting Cromwell at FredHutch, having developed wrapper code and relevant documentation. She also put together some workflows to perform variant calling on RNASeq data, and pointed us to the Broad Institute’s big repository of GATK-based workflows.

Written on February 18, 2020