Requirements
In order to follow along with the examples in this workshop, you’ll need to install some R and/or Python packages, as well as download a reasonably large dataset. Please make sure that you have completed all requirements before the workshop starts!
My working assumption is that you have already cloned the companion GitHub repo to this website. If not, please do so before continuing and navigate to the root of your local clone, using the following commands.
git clone https://github.com/grantmcdermott/duckdb-polars.git
cd duckdb-polars
(Don’t worry if you can’t clone the repo for some reason. You may need to adjust some relative file paths when you are calling the actual code chunks later in the workshop, but we’ll figure it out.)
R and Python Packages
For this workshop, you have the option of following along in either R, Python, or both. Ideally, I’d recommend both since one of my goals is to demonstrate the close equivalency in workflows across languages. But I’ll leave that to you.1
Run the following commands in your R console.
install.packages(c("duckdb", "arrow", "dplyr", "tidyr", "duckplyr"))
polars (and therefore tidypolars) are not on CRAN so we install them from R-universe. Details here.
Sys.setenv(NOT_CRAN = "true")
install.packages(c("polars", "tidypolars"), repos = "https://community.r-multiverse.org")
Note that you will need polars
>= 0.19.1 and tidypolars
>= 0.10.1.
Are you an R user on a Linux machine? If so, I strongly recommend that you configure your user profile to pull in pre-compiled R package binaries for your distro from PPM, rather than installing source packages from CRAN (and then having to compile them on your own machine). This will greatly reduce installation times and other potential install headaches. If you haven’t done this already, or don’t know what I’m talking, then the simplest thing to do is to let the excellent rspm package (link) figure it out for you. Bonus: It will also resolve system dependencies at the same time.
# Run these two commands before installing any other packages
install.packages("rspm")
::enable() rspm
P.S. Once you have installed the rspm package, you can add the following line to your ~/.Rprofile
file and it will automatically figure everything out for you whenever you start a new R session. See the rspm website for additional tips around integration with renv projects and so on.
suppressMessages(rspm::enable())
First create and activate a Python virtual environment from your terminal. (Important: I’ll assume that you are in the current root of this repo.) The exact command varies by operating system.
# MacOS / Linux
python3 -m venv .venv
source .venv/bin/activate
# Windows
py -m venv .venv
.venv\Scripts\activate.bat
Then install the Python packages that we will be using.
python3 -m pip install duckdb polars pyarrow pandas matplotlib -U
python3 -m pip install 'ibis-framework[duckdb,polars]' -U
If you are using VS Code, then there are a few tweaks to this Python setup. First up, once you’ve create your .venv
virtual environment, then should see a pop-up message to the effect of:
We noticed a new environment has been created. Do you want to select it for the
workspace folder?
Select “Yes”, then choose your Python interpreter (ideally Python 3.9 or higher).
Once that’s done, you will also need to install the ipykernel
and jupyter
packages in addition to the packages that I mentioned above. Moreover, I recommend install packages from within VS Code using cell magics, i.e.
```{py}
%pip install ipykernel jupyter -U
%pip install duckdb polars pyarrow pandas -U
%pip install 'ibis-framework[duckdb,polars]' -U
```
NYC taxi data
For this workshop, we’ll make use of the infamous well-known New York City taxi data.
- We’ll just be downloading a single year’s worth of data from 2012. But that will be enough to demonstrate the point and it’s of comparable size to the “typical” dataset that I work with.
- The final dataset is ~8.5 GB compressed on disk and can take 10-20 minutes to download, depending on your internet connection.
You can download the dataset with the below terminal commands.
You will need the aws cli
tool (install link) for these next commands to work. This should be a quick and simple install (you do not need a AWS account), but see further below for some alternative download options.
mkdir -p nyc-taxi/year=2012
aws s3 cp s3://voltrondata-labs-datasets/nyc-taxi/year=2012 nyc-taxi/year=2012 --recursive --no-sign-request
Besides being relatively chonky, there are two features of this dataset that we’ll come back to since they are key to our workflow:
- The data are stored in
.parquet
file format. - These Parquet files are organised in so-called “Hive-style” partitions on disk.
Other data options
Smaller subsets of the data
If you’re pressed for time and/or disk space, feel free to only grab a subset of the data manually. But make sure that you preserve the Hive-style partitioning. Here’s a quick example of how to do it for the first two months.
mkdir -p nyc-taxi/year=2012/month=1
mkdir -p nyc-taxi/year=2012/month=2
aws s3 cp s3://voltrondata-labs-datasets/nyc-taxi/year=2012/month=1/ nyc-taxi/year=2012/month=1 --recursive --no-sign-request
aws s3 cp s3://voltrondata-labs-datasets/nyc-taxi/year=2012/month=2/ nyc-taxi/year=2012/month=2 --recursive --no-sign-request
Alternative download options
If you don’t have the aws cli
tool, or can’t install install it for some reason, then you can always download the dataset directly from R or Python using some of the packages that we installed above. For example:
library(arrow)
library(dplyr)
= "nyc-taxi/year=2012" # Or set your own preferred path
data_path
open_dataset("s3://voltrondata-labs-datasets/nyc-taxi/year=2012") |>
write_dataset(data_path, partitioning = "month")
Be forewarned that these alternative download approaches are going to be slower than the aws cli
approach.
Footnotes
If you’re unsure and just want to pick one, then I recommend R. It’s much easier to install and manage environments. Plus it’s also my preferred language, so you’re likely to get better support from me.↩︎