Grant R. McDermott

Fast geospatial tasks with data.table, geos & co.

2022-01-14T00:00:00+00:00

Motivation

This blog post pulls together various tips and suggestions that I’ve left around the place. My main goal is to show you some simple workflows that I use for high-performance geospatial work in R, leaning on the data.table, sf and geos packages.

If you’re the type of person who likes to load everything at once, here are the R libraries and theme settings that I’ll be using in this post. (Don’t worry if not: I’ll be loading them again in the relevant sections below to underscore why I’m calling a specific library.)

## Data wrangling
library(dplyr)
library(data.table) ## NB: dev version! data.table::update.dev.pkg()

## Geospatial
library(sf)
library(geos)

## Plotting
library(ggplot2)
theme_set(theme_minimal())

## Benchmarking
library(microbenchmark)

data.table + sf workflow

Everyone who does spatial work in R is familiar with the wonderful sf package. You know, this one:

library(sf)
library(ggplot2); theme_set(theme_minimal())

## Grab the North Carolina shapefile that comes bundled with sf
nc_shapefile = system.file("shape/nc.shp", package = "sf")
nc = st_read(nc_shapefile)

## Reading layer `nc' from data source `/usr/lib/R/library/sf/shape/nc.shp' using driver `ESRI Shapefile'
## Simple feature collection with 100 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.32 ymin: 33.88 xmax: -75.46 ymax: 36.59
## Geodetic CRS:  NAD27

## Quick plot
ggplot(nc) + geom_sf()

The revolutionary idea of sf was (is) that it allowed you to treat spatial objects as regular data frames, so you can do things like this:

library(dplyr)

nc |>
  group_by(region = ifelse(CNTY_ID<=1980, 'high', 'low')) |>
  summarise(geometry = st_union(geometry)) |>
  ggplot(aes(fill=region)) + 
  geom_sf()

In the above code chunk, I’m using dplyr to do a grouped aggregation on our North Carolina data object. The aggregation itself is pretty silly—i.e. divide the state into two hemispheres—but the same idea extends to virtually all of dplyr’s capabilities. It makes for a very potent and flexible combination that has driven an awful lot of R-based spatial work in recent years.

At the same time, there’s another powerful data wrangling library in R: data.table. This post is not going to rehash the (mostly pointless) debates about which of dplyr or data.table is better.¹ But I think it’s fair to say that the latter offers incredible performance that makes it a must-use library for a lot of people, including myself. Yet it seems to me that many data.table users aren’t aware that you can use it for spatial operations in exactly the same way.

If you’re following along on your own computer, make sure to grab the development version (v1.14.3) before continuing:

## Assuming you have data.table on your system already
data.table::update.dev.pkg()

## Use the below instead if you don't have data.table installed already
# remotes::install_github("Rdatatable/data.table")

Okay, let’s create a “data.table” version of our nc object and take a quick look at the first few rows and some columns.

library(data.table)

nc_dt = as.data.table(nc)
nc_dt[1:3, c('NAME', 'CNTY_ID', 'geometry')]

##         NAME CNTY_ID                       geometry
##                       
## 1:      Ashe    1825 MULTIPOLYGON (((-81.47 36.2...
## 2: Alleghany    1827 MULTIPOLYGON (((-81.24 36.3...
## 3:     Surry    1828 MULTIPOLYGON (((-80.46 36.2...

At this point, I have to briefly back up to say that the reason I wanted you to grab the development version of data.table is that it “pretty prints” the columns by default. This not only includes the columns types and keys (if you’ve set any), but also the special sfc_MULTIPLOYGON list columns which is where the sf magic is hiding. It’s a small cosmetic change that nonetheless underscores the integration between these two packages.²

Just like we did with dplyr earlier, we can now do grouped spatial operations on this object using data.table’s concise syntax:

nc_dt[, 
      .(geometry = st_union(geometry)), 
      by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))] |> 
  ggplot(aes(geometry=geometry, fill=region)) + 
  geom_sf() +
  labs(caption = "Now brought to you by data.table")

Now, I’ll admit that there are a few tiny tweaks we need to make to the plot call. Unlike with the non-data.table workflow, this time we have to specify the geometry aesthetic with aes(geometry=geometry, ...). Otherwise, ggplot2 won’t know what do with this object. The other difference is that it doesn’t automatically recognise the CRS (i.e. “NAD27”), so the projection is a little off. Again, however, that information is contained with the geometry column of our nc_dt object. It just requires that we provide the CRS to our plot call explicitly.

## Grab CRS from the geometry column
crs = st_crs(nc_dt$geometry)

## Update our previous plot
last_plot() + 
  coord_sf(crs=crs) +
  labs(caption = "Now with the right projection too")

Plotting tweaks aside, I don’t want to lose sight of the main point of this post, namely: sf and data.table play perfectly well together. You can do (grouped) spatial operations and aggregations inside the latter, exactly how you would any other data wrangling task. So if you love data.table’s performance and syntax, then by all means continue using it for your spatial work too. Speaking of performance…

Speeding things up with geos

Update (2022-02-16): The benchmarks in this section are a bit unfair, since geos assumes planar (“flat”) geometries, whereas sf assumes spherical (“curved”) geometries by default. See the postscript at the bottom of this post, which corrects for this discrepancy.

As great as sf is, even its most ardent proponents will admit that it can drag a bit when it comes to big geospatial tasks. I don’t want to imply that that it’s “slow”. But I’ve found that it does lag behind geopandas, for example, when I’m doing heavy geospatial computation or working with really large spatial files. Luckily, there’s a new package in town that offers major performance gains and plays very well with the workflow I demonstrated above.

Dewey Dunnington and Edzer Pebesma’s geos package covers all of the same basic geospatial operations as sf. But it does so by directly wrapping the underlying GEOS API, which is written in C and is thus extremely performant. Here’s a simple example, where we calculate the centroid of each North Carolina county.

library(geos)           ## For geos operations  
library(microbenchmark) ## For benchmarking

## Create a geos geometry object
nc_geos = nc |> as_geos_geometry()

## Benchmark
microbenchmark(
  sf = nc$geometry |> st_centroid(),
  geos = nc_geos |> geos_centroid(), 
  times = 5
  )

## Unit: microseconds
##  expr    min     lq   mean median     uq    max neval cld
##    sf 6789.2 6919.0 7155.8 7289.0 7364.3 7417.6     5   b
##  geos  105.1  121.1  132.3  123.2  126.4  185.7     5  a

A couple of things worth noting. First, the geos centroid calculation completes orders of magnitude faster than the sf equivalent. Second, the executing functions are very similar (st_centroid() vs geos_centroid()). Third, we have to do an explicit as_geos_geometry() coercion before we can perform any geos operations on the resulting object.

That last point seems the most mundane. (Why aren’t you talking more about how crazy fast geos is?!) But it’s important since it underscores a key difference between the two packages and why the developers view them as complements. Unlike sf, which treats spatial objects as data frames, geos only preserves the geometry attributes. Take a look:

head(nc_geos)

## 
## [1] 
## [2] 
## [3] 
## [4] 
## [5] 
## [6]

Gone are all those extra columns containing information about county names, FIPS codes, population numbers, etc. etc. We’re just left with the necessary information to do high-performance spatial operations.

Quick aside on plotting geos objects

Because we’ve dropped all of the sf / data frame attributes, we can’t use ggplot2 to plot anymore. But we can use the base R plotting method:

plot(nc_geos, col = "gray90")
plot(geos_centroid(nc_geos), pch = 21, col = 'red', bg = 'red', add = TRUE)

Actually, that’s not quite true, since an alternative is to convert it back into an sf object with st_as_sf() and then call ggplot2. This is particularly useful because you can hand off some heavy calculation to geos before bringing it back to sf for any additional functionality. Again, the developers of these packages designed them to act as complements.

ggplot() +
  geom_sf(data = nc) +
  geom_sf(data = nc_geos |> geos_centroid() |> st_as_sf(), 
          col = "red")

Okay, back to the main post…

data.table + geos workflow

Finally, we get to the pièce de résistance of today’s post. The fact that as_geos_geometry() creates a GEOS geometry object—rather than preserving all of the data frame attributes—is a good thing for our data.table workflow. Why? Well, because we can just include this geometry object as a list column inside our data.table.³ In turn, this means you can treat spatial operations as you would any other operation inside a data.table. You can aggregate by group, merge, compare, and generally combine the power of data.table and geos as you see fit.

(The same is true for regular data frames and tibbles, but we’ll get to that.)

Let’s prove that this idea works by creating a GEOS column in our data.table. I’ll creatively call this column geo, but really you could call it anything you want (including overwriting the existing geometry column).

nc_dt[, geo := as_geos_geometry(geometry)]
nc_dt[1:3, c('NAME', 'CNTY_ID', 'geo')] ## Print a few rows/columns

##         NAME CNTY_ID                                              geo
##                                            
## 1:      Ashe    1825 
## 2: Alleghany    1827 
## 3:     Surry    1828

GEOS column in hand, we can manipulate or plot it directly from within the data.table. For example, we can recreate our previous centroid plot.

plot(nc_dt[, geo], col = "gray90")
plot(nc_dt[, geos_centroid(geo)], pch = 21, col = 'red', bg = 'red', add = TRUE)

And here’s how we could replicate our earlier “hemisphere” plot:

nc_dt[,
      .(geo = geo |> geos_make_collection() |> geos_unary_union()),
      by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))
      ][, geo] |>
  plot()

This time around the translation from the equivalent sf code isn’t as direct. We have one step (st_union()) vs. two (geos_make_collection() |> geos_unary_union()). The second geo_unary_union() step is clear enough. But it’s the first geos_make_collection()step that’s key for our aggregating task. We have to tell geos to treat everything within the same group (i.e. whatever is in by = ...) as a collective. This extra step becomes very natural after you’ve done it a few times and is a small price to pay for the resulting performance boost.

Speaking of which, it’s nearly time for some final benchmarks. The only extra thing I want to do first is, as promised, include a tibble/dplyr equivalent. The exact same concepts and benefits carry over here, for those of you that prefer the tidyverse syntax and workflow.⁴

nc_tibble = tibble::as_tibble(nc) |> 
  mutate(geo = as_geos_geometry(geometry))

Benchmarks

For this final set of benchmarks, I’m going to horserace the same grouped aggregation that we’ve been using throughout.

microbenchmark(
  
  sf_tidy = nc |>
    group_by(region = ifelse(CNTY_ID<=1980, 'high', 'low')) |>
    summarise(geometry = st_union(geometry)),
  
  sf_dt = nc_dt[, 
                .(geometry = st_union(geometry)), 
                by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))],
  
  geos_tidy = nc_tibble |>  
    group_by(region = ifelse(CNTY_ID<=1980, 'high', 'low')) |>
    summarise(geo = geos_unary_union(geos_make_collection(geo))),
  
  geos_dt = nc_dt[,
                  .(geo = geos_unary_union(geos_make_collection(geo))),
                  by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))],
  
  times = 5
)

## Unit: milliseconds
##       expr    min     lq   mean median     uq    max neval  cld
##    sf_tidy 105.02 105.11 105.59 105.68 106.04 106.10     5    d
##      sf_dt  98.38  98.42  99.46  98.63  98.79 103.07     5   c 
##  geos_tidy  15.73  16.07  16.75  16.29  16.71  18.97     5  b  
##    geos_dt  12.25  12.26  12.43  12.31  12.36  12.94     5 a

Result: A 10x speed-up. Nice! While the toy dataset that we’re using here is too small to make a meaningful difference in practice, those same performance benefits will carry over to big geospatial tasks too. Being able to reduce your computation time by a factor of 10 really makes a difference once you’re talking minutes or hours.

Conclusion

My takeaways:

It’s fine to treat sf objects as data.tables (or vice versa) if that’s your preferred workflow. Just remember to specify the geometry column.
For large (or small!) geospatial tasks, give the geos package a go. It integrates very well with both data.table and the tidyverse, and the high-performance benefits carry over to both ecosystems.

By the way, there are more exciting high-performance geospatial developments on the way in R (as well as other languages) like geoarrow. We’re lucky to have these tools at our disposal.

Postscript: planar vs spherical

Note: This section was added on 2021-01-16.

As Roger Bivand points out on Twitter, I’m not truly comparing apples with apples in the above benchmarks. geos assumes planar (“flat”) geometries, whereas sf does the more complicated task of calculating spherical (“curved”) geometries. More on that here if you are interested. Below I repeat these same benchmarks, but with sf switched to the same planar backend. The upshot is that geos is still faster, but the gap narrows considerably. A reminder that we’re also dealing with a very small dataset, so I recommend benchmarking on your own datasets to avoid the influence of misleading overhead. But I stand by my comment that these differences persist at scale, based on my own experiences and testing.

## Turn off sf's spherical ("S2") backend
sf_use_s2(FALSE)

## Now redo our earlier benchmarks...

## Centroid
microbenchmark(
  sf = nc$geometry |> st_centroid(),
  geos = nc_geos |> geos_centroid(), 
  times = 5
  )

## Unit: microseconds
##  expr    min     lq   mean median     uq    max neval cld
##    sf 2436.0 2521.1 2641.2 2530.5 2539.6 3179.0     5   b
##  geos  105.5  106.2  125.3  120.1  136.8  157.7     5  a

## Hemisphere aggregation
microbenchmark(
  sf_tidy = nc |>
    group_by(region = ifelse(CNTY_ID<=1980, 'high', 'low')) |>
    summarise(geometry = st_union(geometry)),
  sf_dt = nc_dt[, 
                .(geometry = st_union(geometry)), 
                by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))],
  geos_tidy = nc_tibble |>  
    group_by(region = ifelse(CNTY_ID<=1980, 'high', 'low')) |>
    summarise(geo = geos_unary_union(geos_make_collection(geo))),
  geos_dt = nc_dt[,
                  .(geo = geos_unary_union(geos_make_collection(geo))),
                  by = .(region = ifelse(CNTY_ID<=1980, 'high', 'low'))],
  times = 5
  )

## Unit: milliseconds
##       expr   min    lq  mean median    uq   max neval cld
##    sf_tidy 26.20 30.53 32.46  31.70 33.77 40.09     5   c
##      sf_dt 22.38 23.99 27.40  25.37 31.35 33.92     5  bc
##  geos_tidy 16.21 19.91 23.02  23.48 27.32 28.20     5  b 
##    geos_dt 11.88 13.00 14.44  14.51 16.12 16.67     5 a

Use what you want, people. ↩
None of the actual functionality that I show here requires the dev version of data.table. But I recommend downloading it regardless, since v1.14.3 is set to introduce a bunch of other killer features. I might write up a list of my favourites once the new version hits CRAN. In the meantime, if any DT devs are reading this, please pretty please can we include these two PRs (1, 2) into the next release too. ↩
Yes, yes. I know you can include a (list) column of data frames within a data.table. But just bear with me for the moment. ↩
The important thing is that you explicitly convert it to a tibble. Leaving it as an sf object won’t yield the same speed benefits. ↩

Simulations remix: Turn up the base

2021-07-08T00:00:00+00:00

Note: I apologise for the title of this post and will show myself the door shortly.

I wanted to quickly follow up on my last post about efficient simulations in R. If you recall, in that post we used data.table and some other tricks to run 40,000 regressions (i.e. 20k simulations with 2 regressions each) in just over 2 seconds. The question before us today is: Can we go even faster using only base R? And it turns out that the answer is, yes, we can.

My motivation for a follow-up is partially the result of this very nice post by Benjamin Elbers, who replicates my simulation using Julia. In so doing, he demonstrates some of Julia’s killer features; most notably the fact that we don’t need to think about vectorisation — e.g. when creating the data — since Julia’s compiler will take care of that for us automatically.¹ But Ben also does another interesting thing, which is to show the speed gains that come from defining our own (super lean) regression function. He uses Cholesky decomposition and it’s fairly straightforward to do the same thing in R. (Here is a nice tutorial by Issac Lee.)

I was halfway on my way to doing this myself when I stumbled on a totally different post by R Core member, Martin Maechler. Therein he introduces the .lm.fit() function (note the leading dot), which incurs even less overhead than the lm.fit() function I mentioned in my last post. I’m slightly embarrassed to say I had never heard about it until now², but a quick bit of testing “confirms” Martin’s more rigorous benchmarks: .lm.fit yields a consistent 30-40% improvement over even lm.fit.

Now, it would be trivial to amend my previous simulation script to slot in .lm.fit() and re-run the benchmarks. But I thought I’d make this a bit more interesting by redoing the whole thing using only base R. (I’ll load the parallel package, but that comes bundled with the base distribution so hardly counts as cheating.) Here’s the full script with benchmarks for both sequential and parallel implementations at the bottom.

# Data generation ---------------------------------------------------------

gen_data = function(sims=1) {
  
  ## Total time periods in the the panel = 500
  tt = 500
  
  ## x1 covariates
  x1_A = 1 + rnorm(tt*sims, 0, 1)
  x1_B = 1/4 + rnorm(tt*sims, 0, 1)
  x1 = c(x1_A, x1_B)
  
  ## Add second, nested x2 covariates for each country
  x2_A = 1 + x1_A + rnorm(tt*sims, 0, 1)
  x2_B = 1 + x1_B + rnorm(tt*sims, 0, 1)
  x2 = c(x2_A, x2_B)
  
  ## Outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A + 1*x2_A + rnorm(tt*sims, 0, 1)
  y_B = x1_B + 2*x2_B + rnorm(tt*sims, 0, 1)
  y = c(y_A, y_B)
  
  ## Group variables (id and sim)
  id = as.factor(c(rep('A', length(x1_A)), rep('B', length(x1_B))))
  sim = rep(rep(1L:sims, each = tt), times = length(levels(id)))
  
  ## Demeaned covariates
  x1_dmean = x1 - ave(x1, list(sim, id), FUN = mean)
  x2_dmean = x2 - ave(x2, list(sim, id), FUN = mean)
  
  ## Bind in a matrix
  mat = cbind('sim' = sim, 
              'id' = id,
              'y' = y,
              'intercept' = 1, 
              'x1' = x1, 
              'x2' = x2, 
              'x1:x2' = x1*x2, 
              'x1_dmean:x2_dmean' = x1_dmean * x2_dmean)
  
  ## Set order i.t.o simulations
  mat = mat[order(mat[, 'sim']), ]
  
  return(mat)
}

## How many simulations do we want?
n_sims = 2e4

## Generate them all as one big matrix
d = gen_data(n_sims)

## Create index list (for efficient subsetting of the large data matrix)
## Note that each simulation is (2*500=)1000 rows long.
ii = lapply(1L:n_sims-1, function(i) 1L:1e3L + rep(1e3L*(i), each=1e3L))

# Benchmarks --------------------------------------------------------------

library(microbenchmark) ## For high-precision timing
library(parallel)
n_cores = detectCores()


## Convenience function for running the two regressions and extracting the 
## interaction coefficients of interest (saves having to retype everything).
## The key bit is the .lm.fit() function.
get_coefs = 
  function(dat) {
    level = coef(.lm.fit(dat[, c('intercept', 'x1', 'x2', 'x1:x2', 'id')], 
                         dat[, 'y']))[4]
    dmean = coef(.lm.fit(dat[, c('intercept', 'x1', 'x2', 'x1_dmean:x2_dmean', 'id')], 
                         dat[, 'y']))[4]
    return(cbind(level, dmean))
  }

## Run the benchmarks for both sequential and parallel versions
microbenchmark(
  
  sims_sequential = lapply(1:n_sims, 
                           function(i) {
                             index = ii[[i]]
                             get_coefs(d[index, ])
                             }),
  
  sims_parallel = mclapply(1:n_sims, 
                           function(i) {
                             index = ii[[i]]
                             get_coefs(d[index, ])
                             }, 
                           mc.cores = n_cores
                           ),

  times = 1
  )

## Unit: milliseconds
##             expr    min     lq   mean median     uq    max neval
##  sims_sequential 2278.4 2278.4 2278.4 2278.4 2278.4 2278.4     1
##    sims_parallel  880.4  880.4  880.4  880.4  880.4  880.4     1

There you have it. Down to less than a second for a simulation involving 40,000 regressions using only base R.³ On a laptop, no less. Just incredibly impressive.

Conclusion: No grand conclusion today… except a sincere note of gratitude to the R Core team (and Julia devs and so many other OSS maintainers) for providing us with such an incredible base to build from.

P.S. Achim Zeileis (who else?) has another great tip for speeding up simulations where the experimental design is fixed here.

In this house, we stan both R and Julia. ↩
I think Dirk Eddelbuettel had mentioned it to me, but I hadn’t grokked the difference. ↩
Interestingly enough, this knitted R markdown version is a bit slower than when I run the script directly in my R console. But we’re really splitting hairs now. (As an aside: I won’t bother plotting the results, but you’re welcome to run the simulation yourself and confirm that it yields the same insights as my previous post.) ↩

Efficient simulations in R

2021-06-24T00:00:00+00:00

UPDATE (Jun 26): Added Principle 4 and re-ran benchmark using matrices.

Motivation

Being able to code up efficient simulations is one of the most useful skills that you can develop as a social (data) scientist. Unfortunately, it’s also something that’s rarely taught in universities or textbooks.¹ This post will cover some general principles that I’ve adopted for writing fast simulation code in R.

I should clarify that the type of simulations that I, personally, am most interested in are related to econometrics. For example, Monte Carlo experiments to better understand when a particular estimator or regression specification does well (or poorly). The guidelines here should be considered accordingly and might not map well on to other domains (e.g. agent-based models or numerical computation).

Our example: Interaction effects in panel models

I’m going to illustrate by replicating a simulation result in a paper that I really like: “Interaction effects in econometrics” by Balli & Sørensen (2013) (hereafter, BS13).

BS13 does various things, but one result in particular has had a big impact on my own research. They show that empirical researchers working with panel data are well advised to demean any (continuous) variables that are going to be interacted in a regression. That is, rather than estimating the model in “level” terms…

\[Y_{it} = \mu_i + \beta_1X1_{it} + \beta_2X2_{it} + \beta_3X1_{it} \cdot X2_{it} + \epsilon_{it}\]

… you should estimate the “demeaned” version instead²

\[Y_{it} = \beta_0 + \beta_1 (X1_{it} - \overline{X1}_{i.}) + \beta_2 (X2_{it} - \overline{X2}_{i.}) + \beta_3(X1_{it} - \overline{X1}_{i.}) \cdot (X2_{it} - \overline{X2}_{i.}) + \epsilon_{it}\]

Here, $\overline{X1}_{i.}$ refers to mean value of variable $X1$ (e.g. GDP over time) for unit $i$ (e.g. country).

We’ll get to the simulations in a second, but BS13 describe the reasons for their recommendation in very intuitive terms. The super short version — again, you really should read the paper — is that the level model can pick up spurious trends in the case of varying slopes. The implications of this insight are fairly profound… if for no other reason that so many applied econometrics papers employ interaction terms in a panel setting.³

Okay, so a potentially big deal. But let’s see a simulation and thereby get the ball rolling for this post. I’m going to run a simulation experiment that exactly mimics one in BS13 (see Table 3). We’ll create a fake dataset where the true interaction is ZERO. However, the slope coefficient of one of the parent terms varies by unit (here: country). If BS13 is right, then including an interaction term in our model could accidentally result in a spurious, non-zero coefficient on this interaction term. The exact model is

\[y_{it} = \alpha + x_{1,it} + 1.5x_{2,it} + \epsilon_{it}\]

Data generating function

It will prove convenient for me to create a function that generates an instance of the experimental dataset — i.e. corresponding to one simulation run — which is what you see in the code below. The exact details are not especially important. (I’m going to coerce the return object into a data.table instead of standard data frame, but I’ll get back to that later.) For now, just remember that the coefficient on any interaction term should be zero by design. I’ll preview the resulting dataset at the end of the code.

library(data.table)

## Convenience function for generating our experimental panel data. Takes a 
## single argument: `sims` (i.e. how many simulation runs to do we want; defaults 
## to 1).
gen_data = function(sims=1) {
  
  ## Total time periods in the the panel = 500
  tt = 500
  
  sim = rep(rep(1:sims, each = 10), times = 2) ## Repeat twice b/c we have two countries
  
  ## x1 covariates
  x1_A = 1 + rnorm(tt*sims, 0, 1)
  x1_B = 1/4 + rnorm(tt*sims, 0, 1)
  
  ## Add second, nested x2 covariates for each country
  x2_A = 1 + x1_A + rnorm(tt*sims, 0, 1)
  x2_B = 1 + x1_B + rnorm(tt*sims, 0, 1)
  
  ## Outcomes (notice different slope coefs for x2_A and x2_B)
  y_A = x1_A + 1*x2_A + rnorm(tt*sims, 0, 1)
  y_B = x1_B + 2*x2_B + rnorm(tt*sims, 0, 1)
  
  ## Combine in a data table (basically just an enhanced data frame)
  dat = 
    data.table(
      sim,
      id = as.factor(c(rep('A', length(x1_A)), rep('B', length(x1_B)))),
      x1 = c(x1_A, x1_B),
      x2 = c(x2_A, x2_B),
      y = c(y_A, y_B)
      )
  
  ## Demeaned covariates (grouped by country and simulation)
  dat[, 
      `:=` (x1_dmean = x1 - mean(x1),
            x2_dmean = x2 - mean(x2)),
      by = .(sim, id)][]
  
  ## Optional set order i.t.o sims
  setorder(dat, sim)
  
  return(dat)
}
## Generate an instance of the data (using the default arguments)
set.seed(123)
d = gen_data()
d

##       sim id         x1      x2       y x1_dmean x2_dmean
##    1:   1  A  0.4395244  0.4437  0.3716 -0.59507 -1.61707
##    2:   1  A  0.7698225  0.7299  1.7366 -0.26477 -1.33092
##    3:   1  A  2.5587083  3.5407  5.5578  1.52412  1.47994
##    4:   1  A  1.0705084  1.9383  4.2281  0.03592 -0.12246
##    5:   1  A  1.1292877 -0.4201  0.8834  0.09470 -2.48085
##   ---                                                    
##  996:   1  B  0.1600248  1.2367  3.6943 -0.08764 -0.06973
##  997:   1  B  1.3205160  2.5757  6.0264  1.07285  1.26929
##  998:   1  B -1.1011004  0.1763 -1.1776 -1.34877 -1.13005
##  999:   1  B -0.2726167  1.2642  3.4449 -0.52028 -0.04216
## 1000:   1  B  0.0008093  0.5403  1.9158 -0.24686 -0.76607

Let’s run some regressions on one simulated draw of our dataset. Since this is a panel model, I’ll use the (incredible) fixest package to control for country (“id”) fixed-effects.

library(fixest)

mod_level = feols(y ~ x1 * x2 | id, d)
mod_dmean = feols(y ~ x1_dmean * x2_dmean | id, d)
etable(mod_level, mod_dmean, se  = 'standard')

##                               mod_level          mod_dmean
## Dependent Var.:                       y                  y
##                                                           
## x1                    1.195*** (0.0650)                   
## x2                    1.638*** (0.0394)                   
## x1 x x2             -0.1373*** (0.0187)                   
## x1_dmean                                0.9544*** (0.0577)
## x2_dmean                                 1.556*** (0.0388)
## x1_dmean x x2_dmean                        0.0199 (0.0213)
## Fixed-Effects:      ------------------- ------------------
## id                                  Yes                Yes
## ___________________ ___________________ __________________
## S.E. type                      Standard           Standard
## Observations                      1,000              1,000
## R2                              0.86768            0.86062
## Within R2                       0.86761            0.86055

Well, there you have it. The “level” model spuriously yields a statistically significant coefficient on the interaction term. In comparison, the “demeaned” version avoids this trap and also appears to have better estimated the parent term coefficients.

Cool. But to really be sure, we should repeat our simulation many times. (BS13 do it 20,000 times…) And, so, we now move on to the main purpose of this post: How do we write simulation code that efficiently completes tens of thousands of runs? Here follow some key principles that I try to keep in mind.

Principle 1: Trim the fat

Subtitle: lm.fit() is your friend

The first key principle for writing efficient simulation code is to trim the fat as much as possible. Even small differences start to add up once you’re repeating operations tens of thousands of times. For example, does it really make sense to use fixest::feols() for this example data? As much as I am a huge fixest stan, in this case I have to say… no. The package is optimised for high-dimensional fixed-effects, clustered errors, etc. Our toy dataset contains just one fixed-effect (comprising two levels) and we are ultimately only interested in extracting a single coefficient for our simulation. We don’t even need to save the standard errors. Most of fixest’s extra features are essentially wasted. We could probably do better just by using a simple lm() call and specifying the country fixed-effect (“id”) as a factor.

However, lm() objects still contain quite a lot of information (and invoke extra steps) that we don’t need. We can simplify things even further by directly using the fitting function that lm calls underneath the hood. Specifically, the lm.fit() function. This requires a slightly different way of writing our regression model — closer to matrix form — but yields considerable speed gains. Here’s a benchmark to demonstrate.

library(microbenchmark)

microbenchmark(
  feols  = feols(y ~ x1_dmean * x2_dmean | id, d),
  lm     = lm(y ~ x1_dmean * x2_dmean + id, d),
  lm.fit = lm.fit(cbind(1, d$x1_dmean, d$x2_dmean, d$x1_dmean*d$x2_dmean, d$id), d$y),
  times  = 2
  )

## Unit: microseconds
##    expr    min     lq   mean median     uq    max neval cld
##   feols 4684.2 4684.2 4696.9 4696.9 4709.6 4709.6     2   b
##      lm  967.0  967.0 1627.2 1627.2 2287.4 2287.4     2  a 
##  lm.fit  112.2  112.2  114.7  114.7  117.2  117.2     2  a

For this small dataset example, a regular lm() call is about five times faster than feols()… and lm.fit() is a further ten times faster still. Now, we’re talking microseconds here and the difference is not something you’d notice running a single regression. But… once you start running 20,000 of them, then those microseconds start to add up.⁴ Final thing, just to prove that we’re getting the same coefficients:

coef(lm.fit(cbind(1, d$x1_dmean, d$x2_dmean, d$x1_dmean*d$x2_dmean, d$id), d$y))

##       x1       x2       x3       x4       x5 
##  3.16117  0.95435  1.55596  0.01993 -0.14978

The output is less visually appealing a regular regression summary, but we can see the interaction term coefficient of 0.01993247 in the order in which it appeared (i.e. “x4”). FWIW, you can also name the coefficients in the design matrix if you wanted to make it easier to reference a coefficient by name. This is what I’ll be doing in the full simulation right at the end.

coef(lm.fit(cbind('intercept' = 1, 'x1' = d$x1_dmean, 'x2' = d$x2_dmean, 'x1:x2' = d$x1_dmean*d$x2_dmean, 'id' = d$id), d$y))

## intercept        x1        x2     x1:x2        id 
##   3.16117   0.95435   1.55596   0.01993  -0.14978

Principle 2: Generate your data once

Subtitle: It’s much quicker to generate one large dataset than many small ones

One common bottleneck I see in a lot of simulation code is generating a small dataset for each new run of a simulation. This is much less efficient that generating a single large dataset that you can either sample from during each iteration, or subset by a dedicated simulation ID. We’ll get to iteration next, but this second principle really stems from the same core idea: vectorisation in R is much faster than iteration. Here’s a simple benchmark to illustrate, where we generate data for a 100 simulation runs. Note that the relative difference would keep growing as we added more simulations.

microbenchmark(
  many_small = lapply(1:100, gen_data),
  one_big = gen_data(100),
  times = 2
)

## Unit: milliseconds
##        expr     min      lq    mean  median      uq     max neval cld
##  many_small 1391.91 1391.91 1640.19 1640.19 1888.47 1888.47     2   b
##     one_big   25.26   25.26   26.04   26.04   26.81   26.81     2  a

Principle 3: Go parallel or nest

Subtitle: Let data.table and co. handle the heavy lifting

The standard approach to coding up a simulation is to run everything as an iteration, either using a for() loop or an lapply() call. Experienced R programmers are probably reading this section right now and thinking, “Even better; run everything in parallel.” And it’s true. A Monte Carlo experiment like the one we’re doing here is ideally suited to parallel implementation, because each individual simulation run is independent. It’s a key reason why Monte Carlo experiments are such popular tools for teaching parallel programming concepts. (Guilty as charged.)

But any type of explicit iteration — whether it is a for() loop or an lapply() call, or whether it is run sequentially or in parallel — runs up against the same problem as we saw in Principle 2. Specifically, it is slower than vectorisation. So how can we run our simulations in vectorised fashion? Well, it turns out there is a pretty simple way that directly leverages Principle 2’s idea of generating one large dataset: We nest our simulations directly in our large data.table or tibble.

Hadley and Garret’s R for Data Science book has a nice chapter on model nesting with tibbles, and then Vincent has a cool blog post replicating the same workflow with data.table. But, really, the core idea is pretty simple: We can use the advanced data structure and functionality of tibbles or data.tables to run our simulations as grouped operations (i.e. by simulation ID). In other words, just like we can group a data frame and then collapse down to (say) mean values, we can also group a data frame and then run a regression on each subgroup.

Why might this be faster than explicit parallel iteration? Well, basically it boils down to the fact that data.tables and tibbles provide an enhanced structure for returning complex objects (including list columns) and their grouped operations are highly optimised to run in (implicit) parallel at the C++ level.⁵ The internal code of data.table, in particular, is just so insanely optimised that trying to beat it with some explicit parallel loop can be a fool’s errand.

Okay, so let’s see a benchmark. I’m going to compare three options for simulating 100 draws: 1) sequential iteration with lapply(), 2) explicit parallel iteration with parallel::mclapply, and 3) nested (implicit parallel) iteration. For the latter, I’m simply grouping my dataset by simulation ID and then leveraging data.table’s powerful .SD syntax.⁶ Note further than I’m going to run regular lm() calls rather than lm.fit() — see Principle 1 — because I want to keep things simple and familiar for the moment.

library(parallel) ## For parallel::mclapply

## Generate dataset with 1000 simulation draws
set.seed(123)
d = gen_data(100)

microbenchmark(
    sequential = lapply(1:max(d$sim), 
                        function(i) coef(lm(y ~ x1 * x2 + id, d[sim==i]))['x1:x2']
                        ),
    
    parallel = mclapply(1:max(d$sim), 
                        function(i) coef(lm(y ~ x1 * x2 + id, d[sim==i]))['x1:x2'], 
                        mc.cores = detectCores()
                        ),
    
    nested = d[, coef(lm(y ~ x1 * x2 + id, .SD))['x1:x2'], by = sim],
    
    times = 2
    )

## Unit: milliseconds
##        expr    min     lq   mean median     uq    max neval cld
##  sequential 141.90 141.90 155.31 155.31 168.72 168.72     2   b
##    parallel 104.04 104.04 105.25 105.25 106.46 106.46     2  a 
##      nested  84.41  84.41  88.18  88.18  91.95  91.95     2  a

Okay, not a huge difference between the three options for this small benchmark. ~~But — trust me — the difference will grow for the full simulation where we’re comparing the level vs demeaned regressions with lm.fit().~~ UPDATE: Upon reflection, I’m not being quite fair to mclapply() here, because it is being penalised for overhead on a small example. But I definitely stand by my next point. There are also some other reasons why relying on data.table will help us here. For example, parallel::mclapply() relies on forking, which is only available on Linux or Mac. Sure, you could use a different package like future.apply to provide a parallel backend (PSOCK) for Windows, but that’s going to be slower. Really, the bottom line is that we can outsource all of that parallel overhead to data.table and it will automatically handle everything at the C(++) level. Winning.

Principle 4: Use matrices for an extra edge

Subtitle: Save your simulation from having to do extra conversion work

The primary array format of empirical work is the data frame. It’s what we all use, really, so there’s no point expanding on that. (TL;DR data frames are just very convenient for humans to work with and reason about.) However, regressions are run on matrices. Which is to say that when you run a regression in R — and most other languages for that matter — behind the scenes your input data frame is first converted to an equivalent matrix before any computation gets done. Matrices have several features that make them “faster” to compute on than data frames. For example, every element must be of the same type (say, numeric). But let’s just agree that converting a data frame to a matrix requires at least some computational effort. Consider then what happens when we feed our lm.fit() function a pre-created design matrix, instead asking it to convert a bunch of data frame columns on the fly.

set.seed(123)
d = gen_data()

X = cbind(intercept = 1, x1 = d$x1_dmean, x2 = d$x2_dmean, 'x1:x2' = d$x1_dmean*d$x2_dmean, id = d$id)
Y = d$y

microbenchmark(
  lm.fit = lm.fit(cbind(1, d$x1_dmean, d$x2_dmean, d$x1_dmean*d$x2_dmean, d$id), d$y),
  lm.fit_mat = lm.fit(X, Y),
  times  = 5
  )

## Unit: microseconds
##        expr   min    lq  mean median    uq    max neval cld
##      lm.fit 66.56 69.56 121.1  70.27 75.00 323.96     5   a
##  lm.fit_mat 49.81 49.81  52.3  50.34 51.79  59.72     5   a

We’re splitting hairs at this point. I mean, what’s 20 microseconds between friends? And, yet, these 20 microseconds translate to a roughly 40% improvement in relative terms. As I keep saying, even microseconds add up once you multiply them by a couple thousand.

“Okay, Grant.” I can already you you saying. “You just told us to use data.tables and now you’re telling us to switch to matrices. Which is it, man?!” Well, remember what I said earlier about the enhanced structure that data.tables (and tibbles) offer us. We can easily create a list column of matrices inside a data.table (or tibble). We could have done this directly in the gen_data() function. But I’m going to leave that function as-is, and show you how simple it is to collapse columns of an existing data.table into a matrix list column. Once more we’ll use a standard grouped operation — where we are grouping by sim — to do the work:

d = d[, 
      .(Y = list(y),
        X_level = list(cbind(intercept = 1, x1 = x1, x2 = x2, 'x1:x2' = x1_dmean*x2_dmean, id = id)),
        X_dmean = list(cbind(intercept = 1, x1 = x1_dmean, x2 = x2_dmean, 'x1:x2' = x1_dmean*x2_dmean, id = id))),
      by = sim]
d

##    sim                                             Y
## 1:   1 0.3716,1.7366,5.5578,4.2281,0.8834,6.8554,...
##                                                                                                                                                                                                                                                                                                            X_level
## 1:  1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 0.439524, 0.769823, 2.558708, 1.070508, 1.129288, 2.715065, 0.443726, 0.729867, 3.540728, 1.938333,-0.420055, 4.755638, 0.962261, 0.352386, 2.255598,-0.004398,-0.234929, 4.528622, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000,...
##                                                                                                                                                                                                                                                                                                            X_dmean
## 1:  1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000,-0.595066,-0.264768, 1.524118, 0.035918, 0.094697, 1.680475,-1.617066,-1.330924, 1.479937,-0.122458,-2.480846, 2.694847, 0.962261, 0.352386, 2.255598,-0.004398,-0.234929, 4.528622, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000, 1.000000,...

I know the printed output looks a little different, but the key thing to know is that each simulation is now represented by a single row. In this case, we only have one simulation, so our whole data table consists of just one row. Moreover, those fancy list columns contain all of the 500 panel observations — in matrix form — that we need run our regressions. To access whatever is inside one of the list columns, we “unnest” very simply by extracting the first element with brackets, i.e. [[1]]. For example, to extract the Y column of our single simulation dataset, we could do:

head(d$Y[[1]]) ## Just show the first few rows

## [1] 0.3716 1.7366 5.5578 4.2281 0.8834 6.8554

If you’d like to know more about this approach, than I highly recommend Vincent’s aforementioned blog post on the topic. The very last thing I’m going to show you here (since we’ll soon be adapting it to run our full simulation), is how easily everything carries over to operations inside a nested data table. In short, we just use the magic of .SD again:

d[, head(.SD$Y[[1]]), by = sim]

##    sim     V1
## 1:   1 0.3716
## 2:   1 1.7366
## 3:   1 5.5578
## 4:   1 4.2281
## 5:   1 0.8834
## 6:   1 6.8554

Putting it all together

Time to put everything together and run this thing. Like BS13, I’m going to simulate 20,000 runs. I’ll print the time it takes to complete the full simulation at the bottom.

set.seed(123)

## Generate our large dataset of 20k simulations
d = gen_data(2e4)

## Optional: Set key for better collapse performance
setkey(d, sim)

## Collapse into a nested data.table (1 row per simulation), with matrix list columns
d = d[, 
      .(Y = list(y),
        X_level = list(cbind(intercept = 1, x1 = x1, x2 = x2, 'x1:x2' = x1*x2, id = id)),
        X_dmean = list(cbind(intercept = 1, x1 = x1_dmean, x2 = x2_dmean, 'x1:x2' = x1_dmean*x2_dmean, id = id))),
      by = sim]

## Run our simulation
tic = Sys.time()
sims = d[, 
         .(level = coef(lm.fit(.SD$X_level[[1]], .SD$Y[[1]]))['x1:x2'],
           dmean = coef(lm.fit(.SD$X_dmean[[1]], .SD$Y[[1]]))['x1:x2']), 
         by = sim]
Sys.time() - tic

## Time difference of 2.669 secs

And look at that. Just over 2 seconds to run the full 20k simulations! (Can you beat that? Let me know in the comments… UPDATE: Turns out you can thanks to the even faster .lm.fit() function. See follow-up post here.)

All that hard work deserves a nice plot, don’t you think?

par(family = 'HersheySans') ## Optional: Nice font for (base) plotting

hist(sims$level, col = scales::alpha('skyblue', .7), border=FALSE,
     main = 'Simulating interaction effects in panel data',
     xlim = c(-0.3, 0.2), 
     xlab = 'Coefficient values',
     sub = '(True value is zero)')
hist(sims$dmean, add=TRUE, col = scales::alpha('red', .5), border=FALSE)
abline(v = 0, lty = 2)
legend("topright", col = c(scales::alpha(c('skyblue', 'red'), .5)), lwd = 10,
       legend = c("Level", "Demeaned"))

Here we have replicated the key result in BS13, Table 3. Moral of the story: If you have an interaction effect in a panel setting (e.g. DiD!), it’s always worth demeaning your terms and double-checking that your results don’t change.

Conclusion

Being able to write efficient simulation code is a very valuable skill. In this post we have replicated an actual published result, incorporating several principles that have served me well:

Trim the fat (Subtitle: lm.fit() is your friend.)
Generate your data once (Subtitle: It’s much quicker to generate one large dataset than many small ones)
Go parallel or nest (Subtitle: Let data.table and co. handle the heavy lifting)
Use matrices for an extra edge (Subtitle: Save your simulation from having to do extra conversion work)

You certainly don’t have to adopt all of these principles to write your own efficient simulation code in R. There may even be cases where it’s more efficient to do something else. But I’m confident that incorporating at least one or two of them will generally make your simulations much faster.

P.S. If you made it this far and still need convincing that simulations are awesome, watch John Rauser’s incredible talk, “Statistics Without The Agonizing Pain”.

References

Balli, Hatice Ozer, and Bent E. Sørensen. “Interaction effects in econometrics.” Empirical Economics 45, no. 1 (2013): 583-603. Link

Ed Rubin and I are writing a book that will attempt to fill this gap, among other things. Stay tuned! ↩
In their notation, BS13 only demean the interacted terms on $\beta_3$. But demeaning the parent terms on $\beta_1$ and $\beta_2$ is functionally equivalent and, as we shall see later, more convenient when writing the code since we can use R’s * expansion operator to concisely specify all of the terms. ↩
Got a difference-in-differences model that uses twoway fixed-effects? Ya, that’s just an interaction term in a panel setting. In fact, the demeaning point that BS13 are making here — and actually draw an explicit comparison to later in the paper — is equivalent to the argument that we should control for unit-specific time trends in DiD models. The paper includes additional simulations demonstrating this equivalence, but I don’t want to get sidetracked by that here. ↩
Another thing is that lm.fit() produces a much more limited, but leaner return object. We’ll be taxing our computer’s memory less as a result. ↩
That’s basically all that vectorisation is; i.e. a loop implemented at the C(++) level. ↩
This will closely mimic a related example in the data.table vignettes, which you should read if you’re interested to learn more. ↩

A better way to adjust your standard errors

2020-10-19T00:00:00+00:00

Motivation

Consider the following scenario:

A researcher has to adjust the standard errors (SEs) for a regression model that she has already run. Maybe this is to appease a journal referee. Or, maybe it’s because she is busy iterating through the early stages of a project. She’s still getting to grips with her data and wants to understand how sensitive her results are to different modeling assumptions.

Does that sound familiar? I believe it should, because something like that has happened to me on every single one of my empirical projects. I end up estimating multiple versions of the same underlying regression model — even putting them side-by-side in a regression table, where the only difference across columns is slight tweaks to the way that the SEs were calculated.

Confronted by this task, I’m willing to bet that most people do the following:

Run their model under one SE specification (e.g. iid).
Re-run their model a second time under another SE specification (e.g. HC robust).
Re-run their model a third time under yet another SE specification (e.g. clustered).
Etc.

While this is fine as far as it goes, I’m here to tell you that there’s a better way. Rather than re-running your model multiple times, I’m going to advocate that you run your model only once and then adjust SEs on the backend as needed. This approach — what I’ll call “on-the-fly” SE adjustment — is not only safer, it’s much faster too.

Let’s see some examples.

Example 1: sandwich

UPDATE (2021-06-21): You can now automate all of the steps that I show below with a single line of code in the new version(s) of modelsummary. See here.

To the best of my knowledge, on-the-fly SE adjustment was introduced to R by the sandwich package (@Achim Zeilles et al.) This package has been around for well over a decade and is incredibly versatile, providing an object-orientated framework for recomputing variance-covariance (VCOV) matrix estimators — and thus SEs — for a wide array of model objects and classes. At the same time, sandwich just recently got its own website to coincide with some cool new features. So it’s worth exploring what that means for a modern empirical workflow. In the code that follows, I’m going to borrow liberally from the introductory vignette. But I’ll also tack on some additional tips and tricks that I use in my own workflow. (UPDATE (2020-08-23): The vignette has now been updated to include some of the suggestions from this post. Thanks Achim!)

Let’s start by running a simple linear regression on some sample data; namely, the “PetersenCL” dataset that comes bundled with the package.

library(sandwich)
data('PetersenCL')

m = lm(y ~ x, data = PetersenCL)
summary(m)

## 
## Call:
## lm(formula = y ~ x, data = PetersenCL)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.761 -1.368 -0.017  1.339  8.678 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0297     0.0284    1.05      0.3    
## x             1.0348     0.0286   36.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.01 on 4998 degrees of freedom
## Multiple R-squared:  0.208,	Adjusted R-squared:  0.208 
## F-statistic: 1.31e+03 on 1 and 4998 DF,  p-value: <2e-16

Our simple model above assumes that the errors are iid. But we can adjust these SEs by calling one of the many alternate VCOV estimators provided by sandwich. For example, to get a robust, or heteroscedasticity-consistent (“HC3”), VCOV matrix we’d use:

vcovHC(m)

##             (Intercept)          x
## (Intercept)   8.046e-04 -1.155e-05
## x            -1.155e-05  8.072e-04

To actually substitute the robust VCOV into our original model — so that we can print it in a nice regression table and perform statistical inference — we pair sandwich with its companion package, lmtest. The workhorse function here is lmtest::coeftest and, as we can see, this yields an object that is similar to a standard model summary in R.

library(lmtest)

coeftest(m, vcov = vcovHC)

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.0297     0.0284    1.05      0.3    
## x             1.0348     0.0284   36.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To recap: We ran our base m model just the once and then adjusted for robust SEs on the backend using sandwich/coeftest.

Now, I’ll admit that the benefits of this workflow aren’t super clear from my simple example yet. Though, we did cut down on copying-and-pasting of duplicate code and this automatically helps to minimize user error. (Remember: DRY!) But we can easily scale things up to get a better sense of its power. For instance, we could imagine calling a whole host of alternate VCOVs to our base model.

## Calculate the VCOV (SEs) under a range of different assumptions
vc = list(
  "Standard"              = vcov(m),
  "Sandwich (basic)"      = sandwich(m),
  "Clustered"             = vcovCL(m, cluster = ~ firm),
  "Clustered (two-way)"   = vcovCL(m, cluster = ~ firm + year),
  "HC3"                   = vcovHC(m),
  "Andrews' kernel HAC"   = kernHAC(m),
  "Newey-West"            = NeweyWest(m),
  "Bootstrap"             = vcovBS(m),
  "Bootstrap (clustered)" = vcovBS(m, cluster = ~ firm)
  )

You could, of course, print the vc list to screen now if you so wanted. But I want to go one small step further by showing you how easy it is to create a regression table that encapsulates all of these different models. In the next code chunk, I’m going to create a list of models by passing vc to an lapply() call.¹ I’m then going to generate a regression table using msummary() from the excellent modelsummary package (@Vincent Arel-Bundock).

library(modelsummary) ## For great-looking regression tables (among other things)

## Adjust our model SEs on-the-fly
lm_mods = lapply(vc, function(x) coeftest(m, vcov = x))

## Print the regression table
msummary(lm_mods)

	Standard	Sandwich (basic)	Clustered	Clustered (two-way)	HC3	Andrews' kernel HAC	Newey-West	Bootstrap	Bootstrap (clustered)
(Intercept)	0.030	0.030	0.030	0.030	0.030	0.030	0.030	0.030	0.030
	(0.028)	(0.028)	(0.067)	(0.065)	(0.028)	(0.044)	(0.066)	(0.028)	(0.061)
x	1.035	1.035	1.035	1.035	1.035	1.035	1.035	1.035	1.035
	(0.029)	(0.028)	(0.051)	(0.054)	(0.028)	(0.035)	(0.048)	(0.029)	(0.052)
Num.Obs.	5000	5000	5000	5000	5000	5000	5000	5000	5000
AIC	21151.2	21151.2	21151.2	21151.2	21151.2	21151.2	21151.2	21151.2	21151.2
BIC	21170.8	21170.8	21170.8	21170.8	21170.8	21170.8	21170.8	21170.8	21170.8
Log.Lik.	-10572.604	-10572.604	-10572.604	-10572.604	-10572.604	-10572.604	-10572.604	-10572.604	-10572.604

If you’re the type of person — like me — that prefers visual representation, then printing a coefficient plot is equally easy with modelsummary::modelplot(). This creates a ggplot2 object that can be further manipulated as needed. In the code chunk below, I’ll demonstrate this fairly simply by flipping the plot orientation.

library(ggplot2)

modelplot(lm_mods, coef_omit = 'Interc') +
  coord_flip()

And there you have it: An intuitive and helpful comparison across a host of specifications, even though we only “ran” the underlying model once. Simple!

Example 2: fixest

While sandwich covers a wide range of model classes in R, it’s important to know that a number of libraries provide their own specialised methods for on-the-fly SE adjustment. The one that I want to show you for this second example is the fixest package (@Laurent Bergé).

If you follow me on Twitter or have read my lecture notes, you already know that I am a huge fan of this package. It’s very elegantly designed and provides an insanely fast way to estimate high-dimensional fixed effects models. More importantly for today’s post, fixest offers automatic support for on-the-fly SE adjustment. We only need to run our model once and can then adjust the SEs on the backend via a call to either summary(..., se = 'se_type') or summary(..., cluster = c('cluster_vars').²

To demonstrate, I’m going to run some regressions on a subsample of the well-known RITA air traffic data. I’ve already downloaded the dataset from Revolution Analytics and prepped it for the narrow purposes of this blog post. (See the data appendix below for code.) All told we’re looking at 9 variables extending over approximately 1.8 million rows. So, not “big” data by any stretch of the imagination, but my regressions should take at least a few seconds to run on most computers.

library(data.table)

air = fread('~/air.csv')
air

##          year month day_of_week tail_num origin_airport_id dest_airport_id
##       1: 2012     1           3   N320AA             12478           12892
##       2: 2012     1           5   N327AA             12478           12892
##       3: 2012     1           4   N329AA             12892           12478
##       4: 2012     1           5   N336AA             12478           12892
##       5: 2012     1           7   N323AA             12478           12892
##      ---                                                                  
## 1844063: 2011     1           1   N904FJ             14107           15376
## 1844064: 2011     1           1   N7305V             14262           14107
## 1844065: 2011     1           1   N922FJ             14683           11057
## 1844066: 2011     1           1   N935LR             14698           14107
## 1844067: 2011     1           1   N932LR             15376           14107
##          arr_delay dep_delay dep_tod
##       1:       -34         4     am2
##       2:         2        -2     am2
##       3:        -6        -5     am2
##       4:       -32        -9     am2
##       5:        -7        -6     am2
##      ---                            
## 1844063:        -7        -1     pm2
## 1844064:        -3         0     pm1
## 1844065:       -10        -4     pm2
## 1844066:        11         0     pm1
## 1844067:        30        -5     am2

The actual regression that I’m going to run on these data is somewhat uninspired: Namely, how does arrival delay depend on departure delay, conditional on the time of day?³ I’ll throw in a bunch of fixed effects to make the computation a bit more interesting/intensive, but it’s fairly standard stuff. Note that I am running a linear fixed effect model by calling fixest::feols().

But, really, I don’t want you to get sidetrack by the regression details. The main thing I want to focus your attention on is the fact that I’m only going run the base model once, i.e. for mod1. Then, I’m going to adjust the SE for two more models, mod2 and mod3, on the fly via respective summary() calls.

library(fixest)

## Start timer; we'll use for benchmarking later
pt = proc.time()

## Run the model once and once only!
## By default, fixest::feols will cluster the SEs by the first FE (here: month)
mod1 = feols(arr_delay ~ dep_tod / dep_delay | 
               month + year + day_of_week + origin_airport_id + dest_airport_id, 
             data = air)

## Adjust SEs on the fly: two-way cluster
mod2 = summary(mod1, cluster = c('month', 'origin_airport_id'))

## Adjust SEs on the fly: three-way cluster. Note that we can even include
## cluster vars (e.g. tail_num) that weren't present in the original regression
mod3 = summary(mod1, cluster = c('month', 'origin_airport_id', 'tail_num'))

## Stop timer and save results
time_feols = (proc.time() - pt)[3]

Before I get to benchmarking, how about a quick coefficient plot? I’ll use modelsummary::modelplot() again, focusing only on the key “time of day × departure delay” interaction terms.

feols_mods = list(mod1, mod2, mod3)

modelplot(
  feols_mods, 
  coef_map = c('dep_todam1:dep_delay' = 'Midnight × Dep. delay',
               'dep_todam2:dep_delay' = 'Morning × Dep. delay',
               'dep_todpm1:dep_delay' = 'Afternoon × Dep. delay',
               'dep_todpm2:dep_delay' = 'Evening × Dep. delay')
  ) +
  labs(caption = 'Dependent variable: Arrival delay')

Benchmarking

Great, it worked. But did it save time? To answer this question I’ve benchmarked against three other methods:

feols(), again from the fixest package, but this time with each of the three models run separately.
felm() from the lfe package.
reghdfe from the reghdfe package (Stata).

Note: I’m benchmarking against lfe and reghdfe because these two excellent packages have long set the standard for estimating high-dimensional fixed effects models in the social sciences. In other words, I’m trying to convince you of the benefits of on-the-fly SE adjustment by benchmarking against the methods that most people use.

You can find the benchmarking code for these other methods in the appendix. (Please let me know if you spot any errors.) In the interests of brevity, here are the results.

There are several takeaways from this exercise. For example, fixest::feols() is the fastest method even if you are (inefficiently) re-running the models separately. But — yet again and once more, dear friends — the key thing that I want to emphasise is the additional time savings brought on by adjusting the SEs on the fly. Indeed, we can see that the on-the-fly feols approach only takes a third of the time (approximately) that it does to run the models separately. This means that fixest is recomputing the SEs for models 2 and 3 pretty much instantaneously.

To add one last thing about benchmarking, the absolute difference in model run times was not that huge for this particular exercise. There’s maybe two minutes separating the fastest and slowest methods. (Then again, not trivial either…) But if you are like me and find yourself estimating models where each run takes many minutes or hours, or even days and weeks, then the time savings are literally exponential.

Conclusion

There comes a point in almost every empirical project where you have to estimate multiple versions of the same model. Which is to say, the only difference between these multiple versions is how the standard errors were calculated: robust, clustered, etc. Maybe you’re trying to satisfy a referee request before publication. Or, maybe you’re trying to understand how sensitive your results are to different modeling assumptions.

The goal of this blog post has been to show you that there is often a better approach than manually re-running multiple iterations of your model. Instead, I advocate that you run the model once and then adjust your standard errors on the backend, as needed. This “on-the-fly” approach will save you a ton of time if you are working with big datasets. Even if you aren’t working with big data, you will minimize copying and pasting of duplicate code. All of which will help to make your code more readable and cut down on potential errors.

What’s not to like?

P.S. There are a couple of other R libraries with support for on-the-fly SE adjustment, e.g. clubSandwich. Since I’ve used it as a counterfoil in the benchmark, I should add that lfe::felm() provides its own method for swapping out different SEs post-estimation; see here. Similarly, I’ve focused on R because it’s the software language that I use most often and — as far as I am aware — is the only one to provide methods for on-the-fly SE adjustment across a wide range of models. If anyone knows of equivalent methods or canned routines in other languages, please let me know in the comments.

P.P.S. Look, I’m not saying it’s necessarily “wrong” to specify your SEs in the model call. Particularly if you’ve already settled on a single VCOV to use. Then, by all means, use the convenience of Stata’s , robust syntax or the R equivalent lm_robust() (via the estimatr package).

Appendices

Flight data download and prep

if (!file.exists(path.expand('~/air.csv'))) {
  
  ## Download and extract the data
  URL = 'https://packages.revolutionanalytics.com/datasets/AirlineSubsetCsv.tar.gz'
  dest_file = path.expand('~/AirlineSubsetCsv.tar.gz')
  download.file(URL, dest_file, mode = "wb")
  untar(dest_file, exdir = path.expand('~'))
  
  ## Bind together and do some data cleaning
  library(data.table)
  csvs = list.files(path.expand('~/AirlineSubsetCsv/'), full.names = TRUE)
  air = rbindlist(lapply(csvs, fread))
  names(air) = tolower(names(air))
  int_vars = c('arr_delay', 'dep_delay', 'dep_time', 'arr_time')
  air[, (int_vars) := lapply(.SD, as.integer), .SDcols = int_vars]
  
  ## Create a departure 'time of day' factor variable, dividing the day in four
  ## quarters
  air[, dep_tod := fcase(dep_time <= 600, 'am1',
                         dep_time <= 1200, 'am2',
                         dep_time <= 1800, 'pm1',
                         dep_time > 1800, 'pm2')]
  
  ## Subset
  air = air[!is.na(arr_delay), 
            .(year, month, day_of_week, tail_num, origin_airport_id, 
              dest_airport_id, arr_delay, dep_delay, dep_tod)]
  
  ## Write to disk
  fwrite(air, path.expand('~/air.csv'))
  
  ## Clean up
  file.remove(c(dest_file, csvs))
  file.remove(path.expand('~/AirlineSubsetCsv/')) ## empty dir too
}

Benchmarking code for other methods

fixest (separate models)

# library(fixest) ## Already loaded

pt = proc.time()
mod1a = feols(arr_delay ~ dep_tod / dep_delay | 
                month + year + day_of_week + origin_airport_id + dest_airport_id,    
              data = air)
mod2a = summary(
  feols(arr_delay ~ dep_tod / dep_delay | 
          month + year + day_of_week + origin_airport_id + dest_airport_id,          
        data = air), 
  cluster = c('month', 'origin_airport_id')
  )
mod3a = summary(
  feols(arr_delay ~ dep_tod / dep_delay | 
          month + year + day_of_week + origin_airport_id + dest_airport_id,          
        data = air), 
  cluster = c('month', 'origin_airport_id', 'tail_num')
  )
time_feols_sep = (proc.time() - pt)[3]

lfe

library(lfe)

pt = proc.time()
est1 = felm(arr_delay ~ dep_tod / dep_delay | 
               month + year + day_of_week + origin_airport_id + dest_airport_id |
             0 |
             month, 
            data = air)
est2 = felm(arr_delay ~ dep_tod / dep_delay | 
               month + year + day_of_week + origin_airport_id + dest_airport_id |
             0 |
             month + origin_airport_id, 
            data = air)
est3 = felm(arr_delay ~ dep_tod / dep_delay | 
               month + year + day_of_week + origin_airport_id + dest_airport_id |
             0 |
             month + origin_airport_id + tail_num, 
            data = air)
time_felm = (proc.time() - pt)[3]

reghdfe

clear
clear matrix
timer clear
set more off

cd "Z:\home\grant"

import delimited air.csv

// Encode strings as numeric factors (Stata struggles with the former)
encode tail_num, generate(tail_num2)
encode dep_tod, generate(dep_tod2)

// Start timer and run regs
timer on 1

qui reghdfe arr_delay i.dep_tod2 i.dep_tod2#c.dep_delay, ///
  absorb(month year day_of_week origin_airport_id dest_airport_id) ///
  cluster(month)

qui reghdfe arr_delay i.dep_tod2 i.dep_tod2#c.dep_delay, ///
  absorb(month year day_of_week origin_airport_id dest_airport_id) ///
  cluster(month origin_airport_id)

qui reghdfe arr_delay i.dep_tod2 i.dep_tod2#c.dep_delay, ///
  absorb(month year day_of_week origin_airport_id dest_airport_id) ///
  cluster(month origin_airport_id tail_num2)

timer off 1

// Export time
drop _all
gen elapsed = .
set obs 1
replace elapsed = r(t1) if _n == 1
outsheet using "reghdfe-ex.csv", replace

You can substitute with a regular for loop or purrr::map() if you prefer. ↩
You should read the package documentation for a full description, but very briefly: Valid se arguments are “standard”, “hetero”, “cluster”, “twoway”, “threeway” or “fourway”. The cluster argument provides an alternative way to be explicit about which variables you want to cluster on. E.g. You would write cluster = c('country', 'year') instead of se = 'twoway'. ↩
Note that I’m going to use a dep_tod / dep_delay expansion on the RHS to get the full marginal effect of the interaction terms. Don’t worry too much about this if you haven’t seen it before (click on the previous link if you want to learn more). ↩

Even more reshape benchmarks

2020-07-02T00:00:00+00:00

Various people have asked me to add some additional benchmarks to my data reshaping post from earlier this week. I’ve been hesitant to add these as an update, since I didn’t want to distract from the major point I was trying to make in that previous post.¹ However, I’m happy to put these additional benchmarks in a new blog post here.

So, alongside the main methods from last time…

R: data.table::melt and tidyr::pivot_longer
Stata: reshape, sreshape (shreshape), and greshape (gtools)

… the additional benchmarks that we’ll be considering today are:

R: base::reshape()
Stata: greshape with the “dropmiss” and “nochecks” arguments added
Python: pandas.melt
Julia: DataFrames stack

I’ll divide the results into two sections.

Small(ish) data

Our first task will be to reshape the same (sparse) 1,000 by 1,002 dataset from wide to long. Here are the results and I’ll remind you that the x-axis has been log-transformed to handle scaling.

Once more, we see that data.table rules the roost. However, the newly-added DataFrames (Julia) and pandas (Python) implementations certainly put in a good shout, coming in second and third, respectively. Interestingly enough, my two tidyr benchmarks seemed to have shuffled slightly this time around, but that’s only to be expected for very quick operations like this. (We’ll test again in a moment on a larger dataset.) Adding options to gtools yields a fairly modest if noticeable difference, while the base R reshape() command doesn’t totally discrace itself. Certainly much faster than the Stata equivalent.

Large(ish) data

Another thing to ponder is whether the results are sensitive to the relatively small size of the test data. The long-form dataset is “only” 1 million rows deep and the fastest methods complete in only a few milliseconds. So, for this next set of benchmarks, I’ve scaled up to the data by two orders of magnitude: Now we want to reshape a 100,000 by 1,002 dataset from wide to long. In other words, the resulting long-form dataset is 100 million rows deep.²

Without further ado, here are the results. Note that I’m dropping the slowest methods (because I’m not a masochist) and this also means that I won’t need to log-transform the x-axis anymore.

Reassuringly, everything stays pretty much the same from a rankings perspective. The ratios between the different methods are very close to the small data benchmarks. The most notable thing is that gtools manages to claw back time (suggesting some initial overhead penalty), although it still lags the other methods. For reference, the default data.table melt() method completes in just over a second on my laptop, which is just crazy fast. All of the methods here are impressively quick, to be honest.

Summarizing, here is each language represented by its fastest method.

Code

See my previous post for the data generation and plotting code. (Remember to set n = 1e8 for the large data benchmark.) For the sake of brevity, here is quick recap of the main reshaping functions that I use across the different languages and how I record timing.

R

# Libraries ---------------------------------------------------------------

library(tidyverse)
library(data.table)
library(microbenchmark)

# Data --------------------------------------------------------------------

d = fread('sparse-wide.csv')

# Base --------------------------------------------------------------------

base_reshape = function() reshape(d, direction='long', varying=3:1002, sep="")

# tidyverse ---------------------------------------------------------------

## Default
tidy_pivot = function() pivot_longer(d, -c(id, grp))
## Default with na.rm argument
tidy_pivot_narm = function() pivot_longer(d, -c(id, grp), values_drop_na = TRUE)

# data.table --------------------------------------------------------------

DT = as.data.table(d)
## Default
dt_melt = function() melt(DT, id.vars = c('id', 'grp'))
## Default with na.rm argument
dt_melt_narm = function() melt(DT, id.vars = c('id', 'grp'), na.rm = TRUE)

# Benchmark ---------------------------------------------------------------

b = microbenchmark(base_reshape(),
                   tidy_pivot(), tidy_pivot_narm(),
                   dt_melt(), dt_melt_narm(),  
                   times = 5)

Stata

clear
clear matrix
timer clear
set more off

cd "Z:\home\grant\Documents\Projects\reshape-benchmarks"

import delimited "sparse-wide.csv"

// Vanilla Stata
preserve
timer on 1
reshape long x, i(id grp) j(variable) 
timer off 1
restore

// sreshape
preserve
timer on 2
sreshape long x, i(id grp) j(variable) missing(drop all)
timer off 2
restore

// gtools
preserve
timer on 3
greshape long x, by(id grp) key(variable)
timer off 3
restore

// gtools (dropmiss)
preserve
timer on 4
greshape long x, by(id grp) key(variable) dropmiss
timer off 4
restore

// gtools (nochecks)
preserve
timer on 5
greshape long x, by(id grp) key(variable) dropmiss nochecks
timer off 5
restore

timer list

drop _all
gen result = .
set obs 5
timer list
forval j = 1/5{
	replace result = r(t`j') if _n == `j'
}
outsheet using "reshape-results-stata.csv", replace

Python

import pandas as pd
import numpy as np

df = pd.read_csv('sparse-wide.csv')
result = %timeit -o df.melt(id_vars=['id', 'grp'])
result_df = pd.DataFrame({'result':[np.median(result.timings)]})
result_df.to_csv('reshape-results-python.csv')

Julia

using CSV, DataFrames, BenchmarkTools

d = DataFrame(CSV.File("sparse-wide.csv"))
jl_stack = @benchmark stack(d, Not([:id, :grp])) evals=5
CSV.write("reshape-results-julia.csv", DataFrame(result = median(jl_stack)))

Namely: A manual split-apply-combine reshaping approach doesn’t yield the same kind of benefits in R as it does in Stata. You’re much better off sticking to the already-optimised defaults. ↩
Let the record show that I tried running one additional order of magnitude (i.e. a billion rows), but data.table was the only method that reliably completed its benchmark runs without completely swamping my memory (32 GB) and crashing everything. As I said last time, it truly is a marvel for big data work. ↩

Reshape benchmarks

2020-06-30T00:00:00+00:00

Motivation

Over on Twitter, I was reply-tagged in a tweet thread by Ryan Hill. Ryan shows how he overcomes a problem that arises when reshaping a sparse (i.e. unbalanced) dataset in Stata. Namely, how can you cut down on the computation time that Stata wastes with all the missing values, especially when reshaping really wide data? Ryan’s clever solution is very much in the split-apply-combine mould. Manually split the data into like groups (i.e. sharing the same columns), drop any missing observations, and then reshape on those before recombining everything at the end. It turns out that this is a lot faster than Stata’s default reshape command… and there is even a package (sreshape) that implements this for you.

So far so good. But I was asked what the R equivalent of this approach would be. It’s pretty easy to implement — more on that in a moment — but I expressed scepticism that it would yield the same kind of benefits as the Stata case. There are various reasons for my scepticism, including the fact that R’s reshaping libraries are already highly optimised for this kind of thing and R generally does a better job of handling missing values.¹

Sounds like we need a good reshape horserace up in here!

Insert obligatory joke about time spent reading reshape help files.

Benchmarks

Defaults

Similar to Ryan, our task will be to reshape wide data (1000 non-index columns) with a lot of missing observations. I’ll leave my scripts at the bottom of this post, but first a comparison of the “default” reshaping methods. For Stata, that includes the vanilla reshape command and the aforementioned sreshape command, as well as greshape from gtools. For R, we’ll use pivot_longer() from the tidyverse (i.e. tidyr) and melt() from data.table. Note the log scale and the fact that I’ve rebased everything relative to the fastest option.

Unsuprisingly, data.table::melt() is easily the fastest method. However, tidyr::pivot_longer() gives a very decent account of itself and is about three times as fast as gtools’ greshape. The base Stata reshape option is hopelessly slow for this task, demonstrating (among other things) the difficulty it has with missing values.

Manual implementation

Defaults out of the way, let’s implement the manual split-apply-combine approach in R. Again, I’ll leave my scripts at the bottom of the post for you to look at, but I’m essentially just following (variants of) the approach that Ryan adroitly lays out. Note that both tidyr::pivot_longer() and data.table::melt() provide options to drop missing values, so I’m going to try those out too.

As expected, the manual split-apply-combine approach(es) don’t yield any benefits in the R case. In fact, quite the opposite, with it resulting in a rather sizeable performance loss. (Yes, I know that I could try running things in parallel but I can already tell you that the extra overhead won’t be worth it for this particular example.)

Bottom line

For reshaping sparse data, you can’t really do much better than sticking with the defaults in R. data.table remains a speed marvel, although tidyr gives very good account of itself too. Stata users should definitely switch to gtools if they aren’t using it already.

Update: Follow-up post here with additional benchmarks, including other SW languages and a larger dataset.

Code

As promised, here is the code. Please let me know if you spot any errors.

First, generate the dataset (in R).

# Libraries ---------------------------------------------------------------

library(tidyverse)
library(data.table)

# Data prep ---------------------------------------------------------------

set.seed(10)

n = 1e6
n_col=1e3

d = matrix(sample(LETTERS, n, replace=TRUE), ncol=n_col)

## Randomly replace columns with NA values
for(i in 1:nrow(d)) {
    j = sample(2:n_col, 1)
    d[i, j:n_col] = NA_character_
  }
rm(i, j)
## Ensure at least one row has obs for all columns
d[1, ] = sample(LETTERS, n_col, replace = TRUE)

## Get non-missing obs group (only really needed for the manual split-apply-combine approaches)
grp = apply(d, 1, function(x) sum(!is.na(x)))

## Convert to data frame and name columns
d = as.data.frame(d)
colnames(d) = paste0("x", seq_along(d))
d$grp = grp
d$id = row.names(d)
d = d %>% select(id, grp, everything())

# Export -----------------------------------------------------------------
fwrite(d, '~/sparse-wide.csv')

Next, run the Stata benchmarks.

clear
clear matrix
timer clear
set more off

import delimited "~/sparse-wide.csv"

// Vanilla Stata
preserve
timer on 1
reshape long x, i(id grp) j(variable) 
timer off 1
restore

// sreshape
// net install dm0090
preserve
timer on 2
sreshape long x, i(id grp) j(variable) missing(drop all)
timer off 2
restore

// gtools
// ssc install gtools
preserve
timer on 3
greshape long x, by(id grp) key(variable)
timer off 3
restore

timer list

drop _all
gen result = .
set obs 3
timer list
forval j = 1/3{
	replace result = r(t`j') if _n == `j'
}
outsheet using "~/sparse-reshape-stata.csv", replace

Finally, let’s run the R benchmarks and compare.

# Libraries ---------------------------------------------------------------

library(tidyverse)
library(data.table)
library(microbenchmark)
library(hrbrthemes)
theme_set(theme_ipsum())

# tidyverse ---------------------------------------------------------------

## Default
tidy_pivot = function() pivot_longer(d, -c(id, grp))
## Default with na.rm argument
tidy_pivot_narm = function() pivot_longer(d, -c(id, grp), values_drop_na = TRUE)
## Manual split-apply-combine approach
tidy_split = function() map_dfr(unique(d$grp), function(i) pivot_longer(filter(d, grp==i)[1:(i+2)], -c(id, grp)))
## Version of manual split-apply-combine approach that uses nesting
tidy_nest = function() {
  d %>%
    group_nest(grp) %>%
    mutate(data = map2(data, grp, ~ select(.x, 1:(.y+1)))) %>%
    mutate(data = map(data, ~ pivot_longer(.x, -id))) %>%
    unnest(cols = data)
}

# data.table --------------------------------------------------------------

DT = as.data.table(d)
## Default
dt_melt = function() melt(DT, id.vars = c('id', 'grp'))
## Default with na.rm argument
dt_melt_narm = function() melt(DT, id.vars = c('id', 'grp'), na.rm = TRUE)
## Manual split-apply-combine approach
dt_split = function() rbindlist(lapply(unique(DT$grp), function(i) melt(DT[grp==i, 1:(i+2)], id.vars=c('id','grp'))))

# Benchmark ---------------------------------------------------------------

b = microbenchmark(tidy_pivot(), tidy_pivot_narm(), tidy_split(), tidy_nest(), 
                   dt_melt(), dt_melt_narm(), dt_split(), 
                   times = 5)
b
autoplot(b)

# Comparison with Stata results -------------------------------------------

stata = fread('~/sparse-reshape-stata.csv')
stata$method = c('reshape', 'sreshape', 'gtools')
stata$sw = 'Stata'

r = data.table(result = print(b, 's')$median, ## just take the median time
               method = gsub('\\(\\)', '', print(b)$expr),
               sw = 'R'
               )

res = rbind(r, stata)
res[, rel_speed := result/min(result)]

capn = paste0('Task: Wide to long reshaping of an unbalanced (sparse) ', nrow(d),
              ' × ', ncol(d), ' data frame with two ID variables.')

## Defaults only
ggplot(res[method %chin% c('dt_melt', 'tidy_pivot', 'gtools', 'sreshape', 'reshape')], 
       aes(x = rel_speed, y = fct_reorder(method, rel_speed), col = sw, fill = sw)) +
  geom_col() +
  scale_x_log10() +
  scale_color_brewer(palette = 'Set1') + scale_fill_brewer(palette = 'Set1') +
  labs(x = 'Time (relative to fastest method)', y = 'Method', title = 'Reshape benchmark', 
       subtitle = 'Default methods only',
       caption = capn)

## All
ggplot(res, aes(x = rel_speed, y = fct_reorder(method, rel_speed), col = sw, fill = sw)) +
  geom_col() +
  scale_x_log10() +
  labs(x = 'Time (relative to fastest method)', y = 'Method', title = 'Reshape benchmark', 
       caption = capn) +
  scale_color_brewer(palette = 'Set1') + scale_fill_brewer(palette = 'Set1') 

As good as Stata is at handling rectangular data, it’s somewhat notorious for how it handles missing observations. But that’s a subject for another day. ↩

Marginal effects and interaction terms

2019-12-16T00:00:00+00:00

The trick

I recently tweeted one of my favourite R tricks for getting the full marginal effect(s) of interaction terms. The short version is that, instead of writing your model as lm(y ~ f1 * x2), you write it as lm(y ~ f1 / x2). Here’s an example using everyone’s favourite mtcars dataset.

First, partial marginal effects with the standard f1 * x2 interaction syntax.

summary(lm(mpg ~ factor(am) * wt, mtcars))

## 
## Call:
## lm(formula = mpg ~ factor(am) * wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.600 -1.545 -0.533  0.901  6.091 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      31.416      3.020   10.40  4.0e-11 ***
## factor(am)1      14.878      4.264    3.49   0.0016 ** 
## wt               -3.786      0.786   -4.82  4.6e-05 ***
## factor(am)1:wt   -5.298      1.445   -3.67   0.0010 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,	Adjusted R-squared:  0.815 
## F-statistic: 46.6 on 3 and 28 DF,  p-value: 5.21e-11

Second, full marginal effects with the trick f1 / x2 interaction syntax.

summary(lm(mpg ~ factor(am) / wt, mtcars))

## 
## Call:
## lm(formula = mpg ~ factor(am)/wt, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.600 -1.545 -0.533  0.901  6.091 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      31.416      3.020   10.40  4.0e-11 ***
## factor(am)1      14.878      4.264    3.49   0.0016 ** 
## factor(am)0:wt   -3.786      0.786   -4.82  4.6e-05 ***
## factor(am)1:wt   -9.084      1.212   -7.49  3.7e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.59 on 28 degrees of freedom
## Multiple R-squared:  0.833,	Adjusted R-squared:  0.815 
## F-statistic: 46.6 on 3 and 28 DF,  p-value: 5.21e-11

To get the full marginal effect of factor(am)1:wt in the first case, I have to manually sum up the coefficients on the constituent parts (i.e. factor(am)1=14.8784 + factor(am)1:wt=-5.2984). In the second case, I get the full marginal effect of −9.0843 immediately in the model summary. Not only that, but the correct standard errors, p-values, etc. are also automatically calculated for me. (If you don’t remember, manually calculating SEs for multiplicative interaction terms is a huge pain. And that’s before we even get to complications like standard error clustering.)

Note that the lm(y ~ f1 / x2) syntax is actually shorthand for the more verbose lm(y ~ f1 + f1:x2). I’ll get back to this point further below, but I wanted to flag the expanded syntax as important because it demonstrates why this trick “works”. The key idea is to drop the continuous variable parent term (here: x2) from the regression. This forces all of the remaining child terms relative to the same base. It’s also why this trick can easily be adapted to, say, Julia or Stata (see here).

So far, so good. It’s a great trick that has saved me a bunch of time (say nothing of likely user-error) that I recommend to everyone. Yet, I was prompted to write a separate blog post after being asked whether this trick a) works for higher-order interactions, and b) other non-linear models like logit? The answer in both cases is a happy “Yes!”.

Dealing with higher-order interactions

Let’s consider a threeway interaction, since this will demonstrate the general principle for higher-order interactions. First, as a matter of convenience, I’ll create a new dataset so that I don’t have to keep specifying the factor variables.

mtcars2 = mtcars
mtcars2$vs = factor(mtcars2$vs); mtcars2$am = factor(mtcars2$am)

Now, we run a threeway interaction and view the (naive, partial) marginal effects.

fit1 = lm(mpg ~ am * vs * wt, mtcars2)
summary(fit1)

## 
## Call:
## lm(formula = mpg ~ am * vs * wt, data = mtcars2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.305 -1.715 -0.728  1.350  5.362 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.059      4.040    6.20  2.1e-06 ***
## am1           17.304      7.704    2.25    0.034 *  
## vs1            6.468     10.144    0.64    0.530    
## wt            -2.439      0.969   -2.52    0.019 *  
## am1:vs1       -4.705     12.976   -0.36    0.720    
## am1:wt        -5.475      2.467   -2.22    0.036 *  
## vs1:wt        -0.937      3.056   -0.31    0.762    
## am1:vs1:wt     1.083      4.442    0.24    0.809    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.47 on 24 degrees of freedom
## Multiple R-squared:  0.87,	Adjusted R-squared:  0.832 
## F-statistic:   23 on 7 and 24 DF,  p-value: 3.53e-09

Say we are interested in the full marginal effect of the threeway interaction vs1:am1:wt. Even summing the correct parent coefficients is a potentially error-prone process of thinking through the underlying math (which terms are excluded from the partial derivative, etc.) And don’t even get me started on the standard errors…

Now, it should be said that there are several existing tools for obtaining this number that don’t require us working through everything by hand. Here I’ll use my favourite such tool — the margins package — to save me the mental arithmetic.

library(margins)
library(magrittr) ## for the pipe operator

fit1 %>%
  margins(
    variables = "wt",
    at = list(vs = "1", am = "1")
    ) %>%
  summary()

##  factor     vs     am     AME     SE       z      p    lower   upper
##      wt 1.0000 1.0000 -7.7676 2.2903 -3.3916 0.0007 -12.2565 -3.2788

We now at least see that the full (average) marginal effect is −7.7676. Still, while this approach works well in the present example, we can also begin to see some downsides. It requires extra coding steps and comes with its own specialised syntax. Moreover, underneath the hood, margins relies on a numerical delta method that can dramatically increase computation time and memory use for even moderately sized real-world problems. (Is your dataset bigger than 1 GB? Good luck.) Another practical problem is that margins may not even support your model class. (See here.)

So, what about the alternative? Does our little syntax trick work here too? The good news is that, yes, it’s just as simple as it was before.

fit2 = lm(mpg ~ am / vs / wt, mtcars2)
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ am/vs/wt, data = mtcars2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.305 -1.715 -0.728  1.350  5.362 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   25.059      4.040    6.20  2.1e-06 ***
## am1           17.304      7.704    2.25   0.0342 *  
## am0:vs1        6.468     10.144    0.64   0.5298    
## am1:vs1        1.763      8.092    0.22   0.8294    
## am0:vs0:wt    -2.439      0.969   -2.52   0.0189 *  
## am1:vs0:wt    -7.914      2.268   -3.49   0.0019 ** 
## am0:vs1:wt    -3.376      2.898   -1.16   0.2555    
## am1:vs1:wt    -7.768      2.290   -3.39   0.0024 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.47 on 24 degrees of freedom
## Multiple R-squared:  0.87,	Adjusted R-squared:  0.832 
## F-statistic:   23 on 7 and 24 DF,  p-value: 3.53e-09

Again, we get the full marginal effect of −7.7676 (and correct SE of 2.2903) directly in the model object. Much easier, isn’t it?

Where this approach really shines is in combination with plotting. Say, after extracting the coefficients with broom::tidy(), or simply plotting them directly with modelsummary::modelplot(). Model results are usually much easier to interpret visually, but this is precisely where we want to depict full marginal effects to our reader. Here I’ll use the modelsummary package to produce a nice coefficient plot of our threeway interaction terms.

library(modelsummary)
library(ggplot2)    ## for some extra ggplot2 layers
library(hrbrthemes) ## theme(s) I like

## Optional: A dictionary of "nice" coefficient names for our plot
dict = c('am0:vs0:wt' = 'Manual\nStraight',
         'am0:vs1:wt' = 'Manual\nV-shaped',
         'am1:vs0:wt' = 'Automatic\nStraight',
         'am1:vs1:wt' = 'Automatic\nV-shaped')

modelplot(fit2, coef_map = dict) +
  geom_vline(xintercept = 0, col = "orange") +
  labs(
    x = "Marginal effect (Δ in MPG : Δ in '000 lbs)",
    title = " Marginal effect of vehicle weight on MPG", 
    subtitle = "Conditional on transmission type and engine shape"
    ) +
  theme_ipsum()

The above plot immediately makes clear how automatic transmission exacerbates the impact of vehicle weight on MPG. We also see that the conditional impact of engine shape is more ambiguous. In contrast, I invite you to produce an equivalent plot using our earlier fit1 object and see if you can easily make sense of it. (I certainly can’t.)

Aside: Specifying (parent) terms as fixed effects

On the subject of speed, recall that the lm(y ~ f1 / x2) syntax is equivalent to the more verbose lm(y ~ f1 + f1:x2). This verbose syntax provides a clue for greatly reducing computation time for large models; particularly those with factor variables that contain many levels. We simply need specify the parent factor terms as fixed effects (using a specialised library like fixest). Going back to our introductory twoway interaction example, you would thus write the model as follows.

library(fixest)
feols(mpg ~ am:wt | am, mtcars2)

(I’ll let you confirm for yourself that running the above models gives the correct −9.0843 figure from before.)

In case you’re wondering, the verbose equivalent for the f1 / f2 / x3 threeway interaction is f1 + f2 + f1:f2 + f1:f2:x3. So we can use the same FE approach for this more complicated case as follows.¹

## Option 1 using verbose base lm(). Not run.
# summary(lm(mpg ~ am + vs + am:vs + am:vs:wt, mtcars2))

## Option 2 using fixest::feols()
feols(mpg ~ am:vs:wt | am^vs, mtcars2)

## OLS estimation, Dep. Var.: mpg
## Observations: 32 
## Fixed-effects: am^vs: 4
## Standard-errors: Clustered (am^vs) 
##            Estimate Std. Error    t value  Pr(>|t|)    
## am0:vs0:wt   -2.439   2.33e-16 -1.048e+16 < 2.2e-16 ***
## am1:vs0:wt   -7.914   1.04e-15 -7.582e+15 < 2.2e-16 ***
## am0:vs1:wt   -3.376   1.51e-15 -2.229e+15 < 2.2e-16 ***
## am1:vs1:wt   -7.768   2.96e-15 -2.628e+15 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## RMSE: 2.1381     Adj. R2: 0.832197
##                Within R2: 0.566525

There’s our desired −7.7676 coefficient again. This time, however, we also get the added bonus of clustered standard errors (which are switched on by default in fixest::feols() models).

Caveat: The above example implicitly presumes that you don’t care about doing inference on the parent term(s), since these are swept away by the underlying fixed-effect procedures. That is clearly not going to be desirable in every case. But, in practice, I often find that it is a perfectly acceptable trade-off for models that I am running. (For example, when I am trying to remove general calender artefacts like monthly effects.)

Other model classes

The last thing I want to demonstrate quickly is that our little trick carries over neatly to other model classes to. Say, that ~~old workhorse of non-linear stats~~ hot! new! machine! learning! classifier: logit models. Again, I’ll let you run these to confirm for yourself:

## Tired
summary(glm(am ~ vs * wt, family = binomial, mtcars2))
## Wired
summary(glm(am ~ vs / wt, family = binomial, mtcars2))

Okay, I confess: That last code chunk was a trick to see who was staying awake during statistics class. I mean, it will correctly sum the coefficient values. But we all know that the raw coefficient values on generalised linear models like logit cannot be interpreted as marginal effects, regardless of whether there are interactions or not. Instead, we need to convert them via an appropriate link function. In R, the mfx package will do this for us automatically. My real point, then, is to say that we can combine the link function (via mfx) and our syntax trick (in the case of interaction terms). This makes a surprisingly complicated problem much easier to handle.

library(mfx, quietly = TRUE)

## Broke
logitmfx(am ~ vs * wt, mtcars2)

## Call:
## logitmfx(formula = am ~ vs * wt, data = mtcars2)
## 
## Marginal Effects:
##          dF/dx Std. Err.      z  P>|z|    
## vs1    -0.9947    0.0451 -22.07 <2e-16 ***
## wt     -1.2681    0.6398  -1.98  0.047 *  
## vs1:wt  0.4940    0.7191   0.69  0.492    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "vs1"

## Woke
logitmfx(am ~ vs / wt, mtcars2)

## Call:
## logitmfx(formula = am ~ vs/wt, data = mtcars2)
## 
## Marginal Effects:
##          dF/dx Std. Err.      z  P>|z|    
## vs1    -0.9947    0.0451 -22.07 <2e-16 ***
## vs0:wt -1.2681    0.6398  -1.98  0.047 *  
## vs1:wt -0.7741    0.6733  -1.15  0.250    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## dF/dx is for discrete change for the following variables:
## 
## [1] "vs1"

Conclusion

We don’t always want the full marginal effect of an interaction term. Indeed, there are times where we are specifically interested in evaluating the partial marginal effect. (In a difference-in-differences model, for example.) But in many other cases, the full marginal effect of the interaction terms is exactly what we want. The lm(y ~ f1 / x2) syntax trick (and its equivalents) is a really useful shortcut to remember in these cases.

PS. In case, I didn’t make it clear: This trick works best when your interaction contains at most one continuous variable. (This is the parent “x” term that gets left out in all of the above examples.) You can still use it when you have more than one continuous variable, but it will implicitly force one of them to zero. Factor variables, on the other hand, get forced relative to the same base (here: the intercept), which is what we want.

Update. Subsequent to posting this, I was made aware of this nice SO answer by Heather Turner, which treads similar ground. I particularly like the definitional contrast between factors that are “crossed” versus those that are “nested”.

For the fixest::feols case, we don’t have to specify all of the parent terms in the fixed-effects slot — i.e. we just need | am^vs — because these fixed-effects terms are all swept out of the model simultaneously at estimation time. ↩

QuackChat slides

2018-09-13T00:00:00+00:00

The University of Oregon runs a bi-monthly series of pub talks, or “QuackChats” (geddit?), where professors get to engage the public on their research.

I gave such a talk on Tuesday, covering my research on “Big Data and Its Impact on Global Fisheries”. Given the informal setting, I tried to keep it somewhat irreverent and, I hope, entertaining. (The beers helped.) Thanks to everyone who came out!

You can view my slides here.

(Hit F11 to go fullscreen and “p” to see my speaker notes.)

Some recent radio interviews

2018-09-10T00:00:00+00:00

Here are two radio interviews that I did recently, covering the blue paradox and related topics. It’s like reading the actual paper, except quicker and with an accent.

NPR interview on OBP’s “Think Out Loud” with Dave Miller.

Listen to “The Blue Paradox” on Spreaker.

Jefferson Radio’s “Curious: Research meets radio” with Geoffrey Riley.

The blue paradox published in PNAS

2018-08-29T00:00:00+00:00

This blog posted is jointly written with Kyle Meng and has been cross-posted in a couple other places.

Can you actually make a problem worse by promising to solve it?

That’s a conundrum that policymakers face — often unwittingly — on a variety of issues. A famous, if controversial, example comes from the gun control debate in the United States, where calls for tougher legislation in the wake of the 2012 Sandy Hook school massacre were followed by a surge in firearm sales. The reason? Gun enthusiasts tried to stockpile firearms before it became harder to purchase them.

In a new paper published in PNAS, we ask whether the same thing can happen with environmental conservation.

The short answer is “yes”. Using data from Global Fishing Watch (GFW), we show that efforts to ban fishing in a large, ecologically sensitive, part of the ocean paradoxically led to more fishing before the ban could be enforced.

We focus on the Phoenix Islands Protected Area (PIPA), a California-sized swathe of ocean in the central Pacific, known for its remarkable and diverse marine ecosystem. Fishing in PIPA has been banned since January 1, 2015, when it was established as one of the world’s largest marine reserves. The success in enforcing this ban has been widely celebrated by conservationists and scientists alike. Indeed, demonstrating this conclusively helped to launch GFW in the first place.

However, it turns out that the story is more complicated than that. We show that there was a dramatic spike in fishing effort in the period leading up to the ban, as fishermen preemptively clamored to harvest resources while they still could. Here’s the key figure from the paper:

Focus on the red and blue lines in the top panel. The red line shows fishing effort in PIPA. The blue line shows fishing effort over a control region that serves as our counterfactual (i.e. it is very similar to PIPA but no ban was ever implemented there). The dashed vertical line shows the date when the fishing ban was enforced, on January 1, 2015. The earlier solid vertical line shows the earliest mention of an eventual PIPA ban that we could find in the news media, on September 1, 2013.

Notice that fishing effort in the two regions are almost identical to that first news coverage, which is reassuring in terms of the validity of our control region. But then notice the dramatic increase in fishing over PIPA from September 1, 2003 to January 1, 2015, relative to the control group. You can see that difference (and the statistical significance of that difference) more clearly in the bottom panel. The area under that purple line is equivalent in terms of extra fishing to 1.5 years of avoided fishing after the ban.

In summary, anticipation of the fishing ban perversely led to more fishing, undermining the very conservation goal that was being sought and likely placing PIPA in a relatively impoverished state before the policy could be enforced. We call this phenomenon the “blue paradox”.

Alongside our headline finding, there are several other things that we think are noteworthy about the paper:

Our data are rich enough to precisely quantify magnitudes, which is hard to do in other settings. The fact that we can say the surge in fishing effort requires 1.5 years of banned fishing effort just to break even, is a big step forward compared to previous studies.
Similarly, previous studies of preemptive resource extraction have all been land-based and focus on the role of secure property rights. We now show that a preemptive response can happen even in a “commons” like the ocean, where property rights are nominally weak to non-existent. (This is an area for future research, although we speculate about some possible mechanisms in the paper.)
The blue paradox can help to explain a puzzle in the scientific literature: Why aren’t marine protected areas (MPAs) as effective as we thought they would be?

There’s more that we could say, much of which you can find in the paper itself. Please note that our intention is not to denigrate MPAs as a potentially valuable conservation tool, much less claim that PIPA was not worth it. (Far from it!) Rather, our goal is to spark a wider conversation about the tradeoffs involved in designing environmental policies, and the role that new data sources can play in informing those tradeoffs. As we conclude in the paper:

We end on a hopeful note, recognizing that the evidence presented herein would have been impossible only a few years ago due to data limitations. Thanks to the advent of incredibly rich satellite data provided by the likes of GFW, we now have the means to address previously unanswered questions and improve management of our natural resources accordingly.

Note: “The blue paradox: Preemptive overfishing in marine reserves” (PNAS, 2018) is joint work between ourselves, Gavin McDonald and Chris Costello. All of the code and data used in the paper are available at https://github.com/grantmcdermott/blueparadox.

PS — Some nice media coverage of our paper in The Atlantic, Oceana, Phys/UO, Science Daily/UCSB, and the PNAS blog. In addition, here are some radio interviews that I’ve done on the paper.