Simulations remix: Turn up the base
Note: I apologise for the title of this post and will show myself the door shortly.
I wanted to quickly follow up on my last post about efficient simulations in R. If you recall, in that post we used data.table and some other tricks to run 40,000 regressions (i.e. 20k simulations with 2 regressions each) in just over 2 seconds. The question before us today is: Can we go even faster using only base R? And it turns out that the answer is, yes, we can.
My motivation for a followup is partially the result of this very nice post by Benjamin Elbers, who replicates my simulation using Julia. In so doing, he demonstrates some of Julia’s killer features; most notably the fact that we don’t need to think about vectorisation — e.g. when creating the data — since Julia’s compiler will take care of that for us automatically.^{1} But Ben also does another interesting thing, which is to show the speed gains that come from defining our own (super lean) regression function. He uses Cholesky decomposition and it’s fairly straightforward to do the same thing in R. (Here is a nice tutorial by Issac Lee.)
I was halfway on my way to doing this myself when I stumbled on a totally different post by R Core member, Martin Maechler. Therein he introduces the .lm.fit()
function (note the leading dot), which incurs even less overhead than the lm.fit()
function I mentioned in my last post. I’m slightly embarrassed to say I had never heard about it until now^{2}, but a quick bit of testing “confirms” Martin’s more rigorous benchmarks: .lm.fit
yields a consistent 3040% improvement over even lm.fit
.
Now, it would be trivial to amend my previous simulation script to slot in .lm.fit()
and rerun the benchmarks. But I thought I’d make this a bit more interesting by redoing the whole thing using only base R. (I’ll load the parallel package, but that comes bundled with the base distribution so hardly counts as cheating.) Here’s the full script with benchmarks for both sequential and parallel implementations at the bottom.
There you have it. Down to less than a second for a simulation involving 40,000 regressions using only base R.^{3} On a laptop, no less. Just incredibly impressive.
Conclusion: No grand conclusion today… except a sincere note of gratitude to the R Core team (and Julia devs and so many other OSS maintainers) for providing us with such an incredible base to build from.
P.S. Achim Zeileis (who else?) has another great tip for speeding up simulations where the experimental design is fixed here.

In this house, we stan both R and Julia. ↩

I think Dirk Eddelbuettel had mentioned it to me, but I hadn’t grokked the difference. ↩

Interestingly enough, this knitted R markdown version is a bit slower than when I run the script directly in my R console. But we’re really splitting hairs now. (As an aside: I won’t bother plotting the results, but you’re welcome to run the simulation yourself and confirm that it yields the same insights as my previous post.) ↩
Comments