Hello ! I'm an analyst with about 20 years experience in the life sciences, having worked in computational biology since 2005. My PhD is in Biological Sciences, and I have worked at the bench, but these days I'm mostly in front of computers. Here you will find:

- The Research page, which has my resume and portfolio of past projects.
- The Courtyard Kitchen, the companion blog to the ink-on-paper cookbook I published.
- The Prince, an illustrated children's fable that I published via KickStarter.
- Photography, my stint as a photojournalist and self-published photo books.
- Writings, which include essays and a fictional novel I'm writing at a glacial pace.

Below are my blog posts in reverse chronological order. I write about projects I'm working on or books I have read, and occasionally I also write about non-technical topics. The blog posts are always evolving and some may not be polished (or even complete), but I do periodically revisit old posts to fix broken links and update information.

Building AI apps

As a technology matures, the focus gradually shifts from pure research to development aspects such as deploying, scaling and benchmarking. Paralleling the intense general AI usage by non-programmers, tools for AI application development have also progressed significantly, especially in the last few years. And while I think there is still much progress to be made by AI researchers, it’s fun to learn some of the things AI developers deal with if you were thinking about building an AI-powered app yourself in 2025.

more...

My favorite ggplot2 extensions

One of my favorite pieces of software is ggplot2: it implements a useful tool based on elegant design principles, has reasonable defaults and great extensibility. So great is this extensibility that sometimes it’s a little hard to keep track of the many, many software packages ggplot2 has inspired. Below are some of the ggplot2-related packages that I used regularly, ranging from packages that I rediscover from very old notes to more modern packages that I have used recently. This post may be updated as I discover more packages.

more...

Popular books on forecasting

Prediction and forecasting have always been interesting to me, both as personal intellectual curiosity, and a professional concern for a data analyst. Below are the notes I took as I read two books on the subject written for a general audience: Nate Silver’s The Signal and The Noise and Superforecasting by Philip Tetlock and Dan Gardner.

more...

Notetaking with Markdown, LaTeX and beyond

As a professional scientist and lifelong geek, I take notes a lot. After several decades of note-taking, I accumulated a significant amount of information that I sometimes need to retrieve (“I read a nice article that explained Rayleigh vs. Raman scattering really well. Now where is it ?!”). As a person living the Google era, I have also gotten less patient with looking for information. Fortunately, I’ve figured out how to keep notes. The hard way. Mostly.

more...

Multimodal Data Integration

As the cost of high-throughput biological assays fall, it has become more common to perform multimodal data integration, where different assays are performed on the same subjects and/or experimental conditions and the data analyzed together. The assumption is that such integrations would give us additional confidence on the biological insights, or perhaps even discover new ones.

more...

Working with synthetic biomedical data

When developing data analysis methods, one major issue is having access to realistic examples to make analyses robust and reproducible. For simple sanity checks, random number generators (with appropriate constraints on range, frequency, etc.) or Fisher’s Iris dataset could be enough. But if you need the input to be in a highly specific format with realistic biological variation like biomedical data or electronic health records, simple random number generators are no longer sufficient.

more...

Interactive data visualization in Python

A few years ago, I wrote about making GUI applications in Python. However, in my experience, most of the time what we really need are just dashboards, simple Python apps that visualize data, with interactive filtering and faceting. One path of low (if not least) resistance is of course iPython notebooks, but there are increasingly more feature-rich options, depending on your particular needs and constraints.

more...

Resampling methods and Introduction To The Bootstrap book

As humans we all like to have good luck. Professionally however, a “lucky” set of scientific results creates more problems than it solves, giving us false sense of success culminating in a biased view of reality. Whether you work at the wet bench or entirely in silico, a common question that comes up in research is: how likely was my results ? I find myself pondering this question often, and decided read up on bootstrapping and write down things I have learned.

more...

Business strategy books

This post is a natural complement to my post about management. Where people management is concerned with micro effects, strategy is concerned with macro effects, though the two are often intertwined in interesting ways, often against the background of evolving market trends, which make for interesting reading. Below are my thoughts on some business strategy books that I have read.

more...

Management books, workshops and advice

Every grad student has to supervise younger students at some point in their career, and my experience managing 3 teams of undergrads were some of the highlights of my PhD experience. It was a relatively low-risk, informal introduction to management, I learned a lot and I wanted more. After some more informal mentoring experiences at FredHutch, I was a formal manager at my next job, below are notes I have collected on management.

more...

Being productive in Scala

A long time ago, when the Earth’s crust had just cooled, I learned Java. Some years later, my best friend told me about Clojure, a functional language that compiles to the JVM. The idea was intriguing to me, but at that point, I felt Clojure was maybe a little niche (my friend had pitched Clojure as a language to do massively parallel simulations). Since then, other languages targeting the JVM had popped up. I re-entered the fray with Scala while playing with Spark, and am aware of at least one group that uses Scala on a regular basis.

more...

The design of experiments

Having started out in field ecology, done some work at the bench then transitioned to computational biology, I had my fair share of exposure to experimental design. From the relatively simply task of having controls for PCR reaction, to picking sites for a field survey, to designing a complex multi-site vaccine trial, as our capacity for larger experiments grow, their designs become increasingly important. In a perfect world, every experiment would be carefully designed before being carried out, but reality is often sadly messier.

more...

Linear models

Linear models have been around for a long time, and despite attention given to more modern methods, they remain relevant. The principle behind them is easy to understand, though once you look at them rigorously there are a lot to consider. This simplicity means linear models have been extended and built upon for new data types and applications.

more...

Building desktop Graphical User Interfaces for Python code

When working in R, my work mostly involved reproducible analyses, and consisted mostly of scripts. For the occasional interactive report, Shiny works well for wrapping graphical interfaces around R scripts that can be made into stand-alone applications. For building Python GUI apps, there are more options, including tkinter, wxPython and PyQt.

more...

Inferring phylogenies from genomic data

Besides fundamental scientific interest, the phylogeny of infectious pathogens is also important for understanding and thus ultimately controlling the spread of epidemics. This is increasingly important in modern times in light of recurrent outbreaks (eg. MERS, SARS, COVID19). The accessibility of sequencing technologies mean that phylogenies can be rooted in genomic data, often at near real-time speeds.

more...

Meta-analyses

Meta analyses have been around for a long time. By some accounts, the first meta-analyses trace back to 17th century astronomy research. They are also prevalent in medical research, with papers starting in 1900s. Since 1993, the Cochrane Reviews have published comparative health findings. More recently, the growth of open data initiatives have made it easier to perform data-integration and meta analyses in your favorite languages.

more...

Tree-based regression and classification

In prediction and forecasting, the understandability of the model is often as important as accuracy or recall. Decision trees split up the solution space with each branching point. Despite their age, decision trees often have an advantage in interpretability, since branch points are associated with particular thresholds (or classes) in the input data. As a result, decision trees are often used in clinical and operation management contexts.

more...

Workflow managers for bioinformatic pipelines

Due to the growth of data, workflow managers (eg. Airflow, Prefect) have been growing in popularity. In bioinformatics this popularity is further accelerated by the replication crisis in science. Workflow managers automate routine tasks while also ensuring reproducibility by enabling drop-in changes in data, runtime parameters, or even entire toolchains. At Fred Hutch where I used to work, Nextflow and Cromwell were most popular. Elsewhere, Snakemake is also popular though I don’t have much personal experience with it.

more...

Neural networks

One lesson I learned during graduate school was the importance of understanding things from first principles. So for learning about neural networks, I really enjoyed reading Grokking Deep Learning, which implements neural networks using only numpy, keeping the reader from being bogged down with implementation details. After that, Michael Nielsen’s Neural Networks and Deep Learning is a good followup. If you prefer watching videos rather than reading, Josh Starmer’s StatQuest Neural Networks playlist is also excellent. Another excellent introduction is this video from Grant Sanderson*, which also goes into mathematical details.

more...

Handling imbalanced data

When performing classification, one problem occurs when you have much more data in one class than another class, roughly defined as 4:1 or greater ratio in the binary case, though imbalance can also occur in multi-class datasets. In these scenarios, the accuracy paradox states that you can get very high accuracy but really your predictions are mostly biased towards the more abundant class.

more...

Visualizing and analyzing geographic data

This post is written as a follow-up to my post on JavaScript mapping. In the early 2000’s I worked in ecology as a GIS modeler and (very briefly) in the field. Back then ArcView dominated GIS, especially in government agencies. In 2019 ArcView is called ArcGIS, and still looks to be dominant, though alternatives like the open-source qGIS and the commercial eSpatial have matured. Also now you can do many GIS tasks like geocoding and spatial querying without having a full-blown GIS. You just need R and its friendly neighborhood packages.

more...

Modeling Gaussian mixtures

Thanks to R’s roots as a statistical programming language it has very strong support for common statistical tasks like modeling and prediction. In modeling your data as a mixture of distributions, I know of the mixtools package, though there are many others, as reviewed by this paper, and presumably more are being developed all the time. I was told by my colleague Chad about mclust, which seems to strike a good balance between features, ease-of-use, and speed (thanks Chad!).

more...

Time-series analysis of iPhone health data

In a previous blog post, I had exported biking distance, running distance and daily step counts from my iPhone. I then tried to detect a possible change in step count, using pymc3 to model two different distributions. While this achieves the goal of detecting a single changepoint, there is a lot more we can do with that iPhone data. For example, my biking is presumably a stationary process and possibly could have some periodicity, we can try to model these using time-series tools. Relevant R code is found in my timeseries_analysis.R.

more...

Designing Data-Intensive Applications book

I’ve run across Martin Kleppman’s Designing Data-Intensive Applications book before, and it sounded interesting. A few weeks ago a colleague mentioned it again in passing, so I thought it was time to finally read it. Below are my notes, added as I work my way through the book:

more...

Thoughts on clean code and Clean Code book

I’m making my way through Robert C. Martin’s classic Clean Code book. I won’t do an exhaustive summary since there are already many great summaries out there (here’s one), just my general thoughts.

more...

Command line tools for data cleaning and analysis

Having been a UNIX user for a while, I know of the wonderful things built out of command line utilities, like one-liners in awk, sed, or even bash itself. I also knew of Erick Matson’s excellent guide to “second-generation” shell tools (tmux,fd, ag, etc.). The beauty of command line tools is you can chain them together, from the humble cut, grep to the new verticalize, and you can wrapping your code in expect to handle interactivity.

more...

Recommender Systems

I enjoy movies, particularly film critique and cinematography (you should check out the excellent YouTube series Every Frame A Painting for some thoughtful discussions on these topics). When looking for sample data to play with recommender systems, I compiled a list of movies I have seen, and my personal 1-10 rating for them here. Some common approaches to recommender systems are:

more...

Analyzing graphs and social networks

A while back, my labmate Ju showed me a visualization he was working on to explore co-authorships at FredHutch, which inspired me to use REntrez to visualize my much more modest publication network. I wrote a small snippet of code that showed the different research circles I participated in over the years. Before this, I also did some analyses on gene networks (see below).

more...

Bayesian modeling in your favorite language(s)

When I was in college, the division between Bayesian and frequentist statistics seemed significant (pun maybe intended). Perhaps it seemed that way because of this 2012 book or this 2014 NYTimes article ? Interestingly, for just as long, there have been rebuttals, that a “Bayesian vs. frequentist” mindset is not productive or practical. I generally tend to agree with these rebuttals, that you should understand the data and use the right tool for the job. And speaking of tools, here are some ways you can get Bayesian on your favorite dataset:

more...

Interactive street maps in JavaScript

During summer of 2019, my then-girlfriend (now wife) and I started looking at buying a house. Due to Seattle’s infamously hot housing market, we ended up looking at >30 houses before making an offer. The geek in me saw this as an opportunity to do some mapping and visualization, and also learn some new technologies.

more...

Project Euler in R and Python

I thought it would be fun to test my R skills by doing Project Euler. Modern programming languages have rather full-featured libraries, but it’s still nice to learn (or re-learn) things from first principles, or see what others come up with. You can see the results in this GitHub repository: euler-r.

more...