We just read a ~180 million row dataset (from disk!) and did a group-by aggregation on it.
In < 1 second.
On a laptop.
🤯
Let’s do a quick horesrace comparison (similar grouped aggregation, but on a slightly smaller dataset)…
Simple benchmark: Computation time only
DuckDB and Polars are already plenty fast…
Simple benchmark: Computation time + data I/O
… but are even more impressive once we account for data import times
Wait. How??
Better disk storage 🤝 Better memory representation
Two coinciding (r)evolutions enable faster, smarter computation:
1. Better on-disk storage
Mainly talking about the Parquet file format here.
Columnar storage format allows better compression (much smaller footprint) and efficient random access to selected rows or columns (don’t have to read the whole dataset a la CSVs).
2. Better in-memory representation
Standardisation around the Apache Arrow format + columnar representation. (Allows zero copy, fewer cache misses, etc.)
OLAP + deferred materialisation. (Rather than “eagerly” executing each query step, we can be “lazy” and optimise queries before executing them.)
Moreover—and I think this is key—these kinds of benchmarks normally exclude the data I/O component… and the associated benefits of not having to hold the whole dataset in RAM.
There are some fantastically fast in-memory data wrangling libraries out there. (My personal faves: data.table and collapse.) But “in-memory” means that you always have to keep the full dataset in, well, memory. And this can be expensive.
Libraries like DuckDB and Polars sidestep this problem, effectively supercharging your computer’s data wrangling powers.
Examples
Live coding sessions
Let’s head back to the website to work through some notebooks.
Streaming is the feature that enables working with bigger-than-RAM data.
Very easy to use and/or adjust our workflow to these cases…
DuckDB: Simply specify a disk-backed database when you first fire up your connection from Python or R, e.g.
con =dbConnect(duckdb(), dbdir ="nyc.dbb")
Polars: Simply specify streaming when collecting, e.g.
some_query.collect(streaming=True)
What didn’t we cover?
Other cool features (cont.)
Modeling
The modeling part of this workflow is less tightly integrated b/c we generally have to bring the data into RAM.
But being able to quickly I/O parts of large datasets makes it very easy to iteratively run analyses on subsets of your data. E.g., I typically pair with fixest for unmatched performance on high-dimensional data.
You can also run bespoke models via UDFs and/or predictions on database backends. [Ex. 1, 2, 3]
FWIW I believe that the underlying matrix and linear algebra libraries for direct modeling with these tools are coming. [Ex. 1, 2]
Resources
Learning more
DuckDB
DuckDB homepage. Includes a very informative blog and standalone documentation for the client APIs (Python, R, and many others).
Also check out Harlequin for a cool, shell-based DuckDB IDE.
Polars
Polars GitHub Repo. Contains links to the standalone documentation for the client APIS (Python, R, etc.)