Columnar storage format allows better compression (much smaller footprint) and efficient random access to selected rows or columns (don’t have to read the whole dataset a la CSVs).
2. Better in-memory representation
Standardisation around the Apache Arrow format + columnar representation. (Allows zero copy, fewer cache misses, etc.)
OLAP + materialisation. (Rather than “eagerly” executing each query step, we can be “lazy” and optimise queries before executing them.)
Query optimization
Key concepts
Three key optimizations work together:
Lazy Materialization:When computation happens
Predicate Pushdown:Where filtering happens
Projection Pushdown:Which columns are read
Lazy Materialization
When computation happens
Build query plan first, execute later
Only compute when results are needed (.collect(), .show())
Enables global optimization across entire pipeline
Example:SELECT operations are bumped to the top of an (optimzed) query to avoid unecessary work
Predicate Pushdown
Which rows are read
Push WHERE conditions to the storage layer
Filter at file/partition level before reading into memory
Dramatically reduces I/O by skipping irrelevant data
Example:WHERE month = 3 → Only scan March parquet files
Projection Pushdown
Which columns are read
Only read columns actually needed for the query
Skip unnecessary columns at the storage layer
Reduces memory usage and I/O bandwidth
Example:SELECT tip_amount → Only read tip column, skip other 29 columns
Moreover—and I think this is key—these kinds of benchmarks normally exclude the data I/O component… and the associated benefits of not having to hold the whole dataset in RAM.
There are some fantastically fast in-memory data wrangling libraries out there. (My personal faves: data.table and collapse.) But “in-memory” means that you always have to keep the full dataset in, well, memory. And this can be expensive.
Libraries like DuckDB and Polars sidestep this problem, effectively supercharging your computer’s data wrangling powers.
DuckDB
Embedded C++ analytical database (no server needed)
Multiple language frontends (R, Python, Julia, etc.)
SQL interface with “friendly” extensions
Excellent for out-of-memory operations
Polars
Embedded Rust-based DataFrame library
Python/R bindings (multiple language support)
DataFrame interface with lazy evaluation
Built on Apache Arrow memory format
Examples
Live coding sessions
Let’s head back to the website to work through some notebooks.
Streaming is the feature that enables working with bigger-than-RAM data.
Very easy to use and/or adjust our workflow to these cases…
DuckDB: Simply specify a disk-backed database when you first fire up your connection from Python or R, e.g.
con =dbConnect(duckdb(), dbdir ="nyc.dbb")
Polars: Simply specify streaming when collecting, e.g.
some_query.collect(streaming=True)
What didn’t we cover?
Other cool features (cont.)
Modeling
The modeling part used to be less tightly integrated with DuckDB/Polars workflows. But that’s changing rapidly, e.g.: dbreg, duckreg, and duckdb-mlpack.
FWIW being able to quickly I/O parts of large datasets makes it very easy to iteratively run analyses on subsets of your data. E.g., I often pair with fixest for exceptional in-memory performance.
You can also run bespoke models via UDFs and/or predictions on database backends. [Ex. 1, 2, 3]
Resources
Learning more
DuckDB
DuckDB homepage. Includes a very informative blog and standalone documentation for the client APIs (Python, R, and many others).
Also check out Harlequin for a cool, shell-based DuckDB IDE.
Polars
Polars GitHub Repo. Contains links to the standalone documentation for the client APIS (Python, R, etc.)