Command line tools for data cleaning and analysis

Having been a UNIX user for a while, I know of the wonderful things built out of command line utilities, like one-liners in awk, sed, or even bash itself. I also knew of Erick Matson’s excellent guide to “second-generation” shell tools (tmux,fd, ag, etc.). The beauty of command line tools is you can chain them together, from the humble cut, grep to the new verticalize, and you can wrapping your code in expect to handle interactivity.

Processing text: CSV and JSON

Since manipulating structured text is such a common task, there are quite a few tools available. Recently, Miller has gained in popularity, being able to not simply parse CSVs, but also perform SQL-like queries and joins, calculate summary statistics and perform basic regression. Miller will also handle JSONs. Though if you’re working with JSONs a lot, it’s probably better to use JQ and similar JSON utilities.

Bioinformatics and dealing with big data

Working in bioinformatics, I have also come across a whole cottage industry of bioinformatics oneliners using bcftools and samtools (and their refined variants, eg. sambamba). For FASTQ files there is the excellent fastp. To glue these tools together, there are workflow managers as I have described in a previous post.

There are also bioinformatic variants of classic UNIX tools, like seqtk and bioawk . When dealing with big data, it’s good to parallelize your tasks, using both a general-purpose tool like GNU Parallel and parallelized versions of existing tools, like pigz for gzip. Often, these tools are picked the hard way, as documented in this wonderfully informative blogpost about analyzing 25TB of sequencing data.

Data Science at the Command Line

The Data Science at the Command Line book gives a pretty good overview of the subject (the second edition was published in 2021).

While I’m familiar with many of the tools the author covered, at the time I have not heard of Vagrant and was only vaguely aware of csvkit’s extensive capabilities. I like it when tools stay true to the UNIX ethos of “do one thing, do it well”, but it’s nice to have a comprehensive toolkit for a narrowly-defined task. Not re-inventing the wheel is a noble goal to strive for.

Increasing in complexity, the book also covers Drake, “GNU make for data”, then onto visualization. I have to admit, I’m a little less convinced about using command line tools to visualize data. Sure, ImageMagick is very powerful and have saved me tons of time processing images in batch mode, but calling tee some_exploratory_plot.png | display every time seems a little tedious to me. UPDATE: see below

Finally, the author covered Weka, a Java-based tool for clustering/regression and BigML for applying ML models.

The second edition added more tools: Tapkee for dimensional reduction, regression with Vowpal Wabbit, among others. The author also contributed additional utilities in his own GitHub Repo.

Visualizing data in the command line

Visualizing data in the command line used to mean exporting your figures into JPG/PNG files, then opening them with another GUI program. However, now you can feed your CSV (or even STDIN pipes, from say, your favorite light-weight database) into YouPlot and make plots appear in the command line.

Written on November 2, 2019