Hello ! I'm an analyst with a background in the life sciences. I have worked at the bench, but these days I'm mostly in front of computers. Here you will find:

- The Research page, which has my resume and portfolio of past projects.
- The Courtyard Kitchen, a companion blog to the ink-on-paper cookbook I published.
- The Prince, an illustrated children's fable that I published via KickStarter.
- Photography, my stint as a photojournalist and self-published photo books.
- Writings, which include essays and a fictional novel I'm writing at a glacial pace.

Below are my blog posts in reverse chronological order, they're about projects I'm working on or topics I have read about, with the occasional random pondering thrown in. As a result, these posts are always evolving and some may not be polished (or even complete).

Resampling methods and Introduction To The Bootstrap book

Personally we all like to have good luck. Professionally however, a “lucky” set of scientific results creates more problems than it solves, giving us false sense of success culminating in a biased view of reality. Whether you work at the wet bench or entirely in silico, a common question that comes up in research is: how likely was my results ? I find myself pondering this question often, and decided read up on bootstrapping and write down things I have learned.

more...

Business strategy books

This post is a natural complement to my post about management. Where people management is concerned with micro effects, strategy is concerned with macro effects, though the two are often intertwined in interesting ways, often against the background of evolving market trends, which make for interesting reading. Below are my thoughts on some business strategy books that I have read.

more...

Management books, workshops and advice

Every grad student has to supervise younger students at some point in their career, and my experience managing 3 teams of undergrads were some of the highlights of my PhD experience. It was a relatively low-risk, informal introduction to management, I learned a lot and I wanted more. After some more informal mentoring experiences at FredHutch, I was a formal manager at my next job, below are notes I have collected on management.

more...

Being productive in Scala

A long time ago, when the Earth’s crust had just cooled, I learned Java. Some years later, my best friend told me about Clojure, a functional language that compiles to the JVM. The idea was intriguing to me, but at that point, I felt Clojure was maybe a little niche (my friend had pitched Clojure as a language to do massively parallel simulations). Since then, other languages targeting the JVM had popped up. I re-entered the fray with Scala while playing with Spark, and am aware of at least one group that uses Scala on a regular basis.

more...

The design of experiments

Having started out in field ecology, done some work at the bench then transitioned to computational biology, I had my fair share of exposure to experimental design. From the relatively simply task of having controls for PCR reaction, to picking sites for a field survey, to designing a complex multi-site vaccine trial, as our capacity for larger experiments grow, their designs become increasingly important. In a perfect world, every experiment would be carefully designed before being carried out, but reality is often sadly messier.

more...

Linear models

Linear models have been around for a long time, and despite attention given to more modern methods, they remain relevant. The principle behind them is easy to understand, though once you look at them rigorously there are a lot to consider. This simplicity means linear models have been extended and built upon for new data types and applications.

more...

Building Graphical User Interfaces for Python code

When working in R, my work mostly involved reproducible analyses, and consisted mostly of scripts. For the occasional interactive report, Shiny works well for wrapping graphical interfaces around R scripts that can be made into stand-alone applications. For building GUIs in Python, there are more options, including tkinter, wxPython and PyQt.

more...

Inferring phylogenies from genomic data

Besides fundamental scientific interest, the phylogeny of infectious pathogens is also important for understanding and thus ultimately controlling the spread of epidemics. This is increasingly important in modern times in light of recurrent outbreaks (eg. MERS, SARS, COVID19). The accessibility of sequencing technologies mean that phylogenies can be rooted in genomic data, often at near real-time speeds.

more...

Meta-analyses

Meta analyses have been around for a long time. By some accounts, the first meta-analyses trace back to 17th century astronomy research. They are also prevalent in medical research, with papers starting in 1900s. Since 1993, the Cochrane Reviews have published comparative health findings. More recently, the growth of open data initiatives have made it easier to perform data-integration and meta analyses in your favorite languages.

more...

Tree-based regression and classification

In prediction and forecasting, the understandability of the model is often as important as accuracy or recall. Decision trees split up the solution space with each branching point. Despite their age, decision trees often have an advantage in interpretability, since branch points are associated with particular thresholds (or classes) in the input data. As a result, decision trees are often used in clinical and operation management contexts.

more...

Workflow managers for bioinformatic pipelines

Due to the growth of data, workflow managers (eg. Airflow, Prefect) have been growing in popularity. In bioinformatics this popularity is further accelerated by the replication crisis in science. Workflow managers automate routine tasks while also ensuring reproducibility by enabling drop-in changes in data, runtime parameters, or even entire toolchains. At Fred Hutch where I used to work, Nextflow and Cromwell were most popular. Elsewhere, Snakemake is also popular though I don’t have much personal experience with it.

more...

Neural network noodling

One lesson I took away from my time in graduate school was the importance of understanding things from first principles. So for learning about neural networks, I really enjoyed reading Grokking Deep Learning, which implements neural networks using only numpy, keeping the reader from being bogged down with implementation details. Another excellent introduction is this video video from Grant Sanderson*, which also goes into mathematical details. After that, Michael Nielsen’s Neural Networks and Deep Learning is a good followup.

more...

Handling imbalanced data

When performing classification, one problem occurs when you have much more data in one class than another class, roughly defined as 4:1 or greater ratio in the binary case, though imbalance can also occur in multi-class datasets. In these scenarios, the accuracy paradox states that you can get very high accuracy but really your predictions are mostly biased towards the more abundant class.

more...

Visualizing and analyzing geographic data

This post is written as a follow-up to my post on JavaScript mapping. In the early 2000’s I worked in ecology as a GIS modeler and (very briefly) in the field. Back then ArcView dominated GIS, especially in government agencies. In 2019 ArcView is called ArcGIS, and still looks to be dominant, though alternatives like the open-source qGIS and the commercial eSpatial have matured. Also now you can do many GIS tasks like geocoding and spatial querying without having a full-blown GIS. You just need R and its friendly neighborhood packages.

more...

Modeling Gaussian mixtures

Thanks to R’s roots as a statistical programming language it has very strong support for common statistical tasks like modeling and prediction. In modeling your data as a mixture of distributions, I know of the mixtools package, though there are many others, as reviewed by this paper, and presumably more are being developed all the time. I was told by my colleague Chad about mclust, which seems to strike a good balance between features, ease-of-use, and speed (thanks Chad!).

more...

Time-series analysis of iPhone health data

In a previous blog post, I had exported biking distance, running distance and daily step counts from my iPhone. I then tried to detect a possible change in step count, using pymc3 to model two different distributions. While this achieves the goal of detecting a single changepoint, there is a lot more we can do with that iPhone data. For example, my biking is presumably a stationary process and possibly could have some periodicity, we can try to model these using time-series tools. Relevant R code is found in my timeseries_analysis.R.

more...

Command line tools for data cleaning and analysis

Having been a UNIX user for a while, I know of the wonderful things built out of command line utilities, like one-liners in awk, sed, or even bash itself. I also knew of Erick Matson’s excellent guide to “second-generation” shell tools (tmux,fd, ag, etc.). The beauty of command link tools is you can chain them together, from the humble cut, grep to the new verticalize, and you can wrapping your code in expect to handle interactivity.

more...

Recommender Systems

I enjoy movies, particularly film critique and cinematography (you should check out the excellent YouTube series Every Frame A Painting for some thoughtful discussions on these topics). When looking for sample data to play with recommender systems, I compiled a list of movies I have seen, and my personal 1-10 rating for them here. Some common approaches to recommender systems are:

more...

Analyzing graphs and networks

A while back, my labmate Ju showed me a visualization he was working on to explore co-authorships at FredHutch, which inspired me to use REntrez to visualize my much more modest publication network. I wrote a small snippet of code that showed the different research circles I participated in over the years. Before this, I also did some analyses on gene networks (see below).

more...

Bayesian modeling in your favorite language(s)

When I was in college, the division between Bayesian and frequentist statistics seemed significant (pun maybe intended). Perhaps it seemed that way because of this 2012 book or this 2014 NYTimes article ? Interestingly, for just as long, there have been rebuttals, that a “Bayesian vs. frequentist” mindset is not productive or practical. I generally tend to agree with these rebuttals, that you should understand the data and use the right tool for the job. And speaking of tools, here are some ways you can get Bayesian on your favorite dataset:

more...

Interactive street maps in JavaScript

During summer of 2019, my then-girlfriend (now wife) and I started looking at buying a house. Due to Seattle’s infamously hot housing market, we ended up looking at >30 houses before making an offer. The geek in me saw this as an opportunity to do some mapping and visualization, and also learn some new technologies.

more...

Project Euler in R and Python

I thought it would be fun to test my R skills by doing Project Euler. Modern programming languages have rather full-featured libraries, but it’s still nice to learn (or re-learn) things from first principles, or see what others come up with. You can see the results in this GitHub repository: euler-r.

more...