Command line tools for data cleaning and analysis

Having been a UNIX user for a while, I know of the wonderful things built out of command line utilities, like one-liners in awk, sed, or even bash itself. I also knew of Erick Matson’s excellent guide to “second-generation” shell tools (tmux,fd, ag, etc.). The beauty of command link tools is you can chain them together, from the humble cut, grep to the new verticalize, and you can wrapping your code in expect to handle interactivity.

Bioinformatics and dealing with big data

Working in bioinformatics, I have also come across a whole cottage industry of bioinformatics oneliners using bcftools and samtools (and their refined variants, eg. sambamba). For FASTQ files there is the excellent fastp. To glue these tools together, there are workflow managers as I have described in a previous post.

There are also bioinformatic variants of classic UNIX tools, like seqtk and bioawk . When dealing with big data, it’s good to parallelize your tasks, using both a general-purpose tool like GNU Parallel and parallelized versions of existing tools, like pigz. Often, these tools are picked the hard way, as documented in this wonderfully informative blogpost about analyzing 25TB of sequencing data.

Data Science at the Command Line

The Data Science at the Command Line book, now in its 2nd edition, gives a pretty good big-picture view.

While I’m familiar with many of the tools the author mentions, I have not heard of Vagrant and was only vaguely aware of csvkit’s extensive capabilities. I like it when tools stay true to the UNIX ethos of “do one thing, do it well”, but it’s nice to have a comprehensive toolkit for a well-defined task. Not re-inventing the wheel is a noble goal to strive for.

Increasing in complexity, the book also covers Drake, “GNU make for data”, then onto visualization. I have to admit, I’m a little less convinced about using commandline tools to visualize data. Sure, ImageMagick is very powerful and have saved me tons of time processing images in batch mode, but calling tee some_exploratory_plot.png | display everytime seems a little tedious to me.

Finally, the author covered Weka, a Java-based tool for clustering/regression and BigML for applying ML models.

The second edition added more tools: Tapkee for dimensional reduction, regression with Vowpal Wabbit, among others. The author also contributed additional utilities in his own GitHub Repo.

Other Tools

Recently, Miller has gained in popularity, since in addition to CSV/TSV it can also handle JSON, inviting natural comparisons to JQ.

Written on November 2, 2019