Visualizing and analyzing geographic data

This post is written as a follow-up to my post on JavaScript mapping. In the early 2000’s I worked in ecology as a GIS modeler and (very briefly) in the field. Back then ArcView dominated GIS, especially in government agencies. In 2019 ArcView is called ArcGIS, and still looks to be dominant, though alternatives like eSpatial have matured and even the open-source qGIS have progressed significantly. Also now you can do many GIS tasks like geocoding and spatial querying without having a full-blown GIS. You just need R and its friendly neighborhood packages.

Geographic scale considerations

Before you install every R geographic package in CRAN or GitHub, it’s important to know the physical scale of the task you’re performing. Mapping at street level is different when mapping countries or entire regions. For one, you don’t need to worry about projection in the former but it is critically important in the latter. So a static map of the continental US only needs a single function call. Similarly, if you want to highlight a few cities on that map, simply adding the corresponding markers directly via their latitude+longitude will work, without needing the overkill of a full-blown geocoding workflow.

The R ecosystem for working with geographic data

For obtaining data, the rnaturalearth package draws from the excellent database of the same name, similarly osmdata pulls from OpenStreetMap (OSM).

For data structures, one well-known package is sf, which allows you to store Simple Features, while also providing some data processing and analysis functions. For handling data larger than a single computer’s memory, Apache Arrow and Parquet are very useful. These are supported in R through geoarrow and sfarrow provided by GeoParquet standard.

Geographic data are for the most part tabular, so you can do much of the data wrangling using tidyverse tools if you wish. For analysis, you can use the spdep package to calculate spatial auto-correlation (eg. Moran’s I), spatstat and stpp to understand point patterns and perform other statistical work. To detect clusters of objects in geographic data, particularly useful in spatial epidemiology, SpatialEpi implements the classic Besag-Newell algorithm, and scanstatistics implements different scan statistics such as Kulldorff’s.

For handling projections, the low-level GDAL library provides bindings for both R (in the form of rgdal) and python.

Geocoding can be done using ggmap, tmap or tidygeocoder. These packages rely on data from GoogleMaps and/or OSM. In general I prefer OSM since it doesn’t require an API key though the data from GoogleMaps tend to be of higher quality. If you want to geocode IP addresses, there is an appropriately named r_IPgeocode package for that, of course.

For plotting maps, ggplot2 conveniently implements geom_sf for static maps, and coord_sf for projection, as described in this great writeup on r-spatial.org. If instead you want to create hexbin maps, eg. election results, the rgeos package can calculate centroids of each bin, but is now deprecated from CRAN sf::centroid() or terra::centroids() can calculate centroids of each bin, which can then be fortified to a geom_polygon, as demonstrated on r-graph-gallery.

It’s good to be polyglot: geographic data in Python

If you’re working in Python, GeoViews for handling large-scale maps, part of the HoloViz* collection of libraries. For visualization, you can use the old-school matplotlib, the new-school seaborn, as well as a Python version of ggplot. If you need interactivity, the web-oriented Bokeh works well.

Further reading

Many of the packages in this post gives a bit of context for their functions. The Spatial Data Science With Applications in R book is a good resource that covers both theory and general spatial applications. Applied Spatial Analysis for Public Health is an online course that begins with spatial concepts before focusing on public health.

*Somewhat confusingly, HoloViz also provides hvPlot, which allows general-purpose plotting that partially overlaps with seaborn/matplotlib/ggplot.

Written on December 3, 2019