DuckDB SQL

Use the same queries from R, Python, Julia…

Load libraries

library(duckdb)
Loading required package: DBI
import duckdb
import time # just for timing some queries

Create a database connection

The first thing we need to do is instantiate a connection with an in-memory database.1

con = dbConnect(duckdb(), shutdown = TRUE)
con = duckdb.connect(database = ':memory:', read_only = False)

Digression: In-memory versus on-disk

The fact that our connection lives “in memory” is a killer feature of DuckDB (one that it inherits from SQLite). We don’t need to connect to some complicated, existing database infrastructure to harness all of DuckDB’s power. Instead we can just spin up an ephemeral database that interacts directly with our R (or Python, or Julia, etc.) client.

However, it’s worth noting that you can create a persistent, disk-backed database simply by providing a database file path argument as part of your connection, e.g.

## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.

# con = dbConnect(duckdb(), dbdir = "nyc.duck")
## Uncomment and run the next line if you'd like to create a persistent,
## disk-backed database instead.

# con = duckdb.connect(database = 'nyc.duck', read_only = False)

(Note that the ".duck" file extension above is optional. You could also use ".db", ".dbb", or whatever you want really.)

Bigger than RAM data?

One really important benefit of creating a persistent disk-backed database is that it enables out-of-core computation for bigger than RAM data. See here for more details and performance considerations (which are still great).

First example

We’ll start with a simple aggregation query to get situated. I’ll also use this example to highlight some general features of DuckDB SQL and the underlying query engine.

Okay, first query. Let’s say we want to know: What is the average tip per passenger count? A typical SQL job for this question might look as follows:

SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count

(Where the last ORDER BY statement is optional. Note that ordering (i.e., sorting) is a potentially expensive operation but we’ll get back to that later.)

This is perfectly valid DuckDB SQL too. However, we can rewrite it with slightly nicer syntax thanks DuckDB’s “friendly SQL”. The key changes for this simple query are going to be: (1) the FROM statement comes first, and (2) we can use the GROUP BY ALL and ORDER BY ALL statements to avoid writing out the “passenger_count” grouping column multiple times.2

FROM 'nyc-taxi/**/*.parquet'
SELECT
  passenger_count,
  AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
DuckDB’s “friendly SQL”

One of the under-appreciated (IMHO) features of DuckDB is that it supports many syntax enhancements over tradional SQL dialects, which they collectively dub “friendly SQL”. Together these syntax enhancements allow you to write much more ergonomic SQL queries that cut down on duplication and logical inconsistencies.

To run this operation from our R or Python client, simply pass the SQL query as a string to our connection. Let’s use this as a chance to save the result and time our query too.

tic = Sys.time()
dat1 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)
toc = Sys.time()

dat1
   passenger_count  mean_tip
1                0 0.8620988
2                1 1.1510110
3                2 1.0815798
4                3 0.9629494
5                4 0.8445190
6                5 1.1027325
7                6 1.1283649
8                7 0.5441176
9                8 0.3507692
10               9 0.8068000
11              10 0.0000000
12              65 0.0000000
13              66 1.5000000
14             177 1.0000000
15             208 0.0000000
16             247 2.3000000
17             249 0.0000000
18             254 0.0000000
toc - tic
Time difference of 1.211972 secs
tic = time.time()
dat1 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      AVG(tip_amount) AS mean_tip
    GROUP BY ALL
    ORDER BY ALL
    '''
    )
)
toc = time.time()

dat1
┌─────────────────┬─────────────────────┐
│ passenger_count │      mean_tip       │
│      int64      │       double        │
├─────────────────┼─────────────────────┤
│               0 │  0.8620988141424404 │
│               1 │  1.1510109615454076 │
│               2 │  1.0815798424001326 │
│               3 │  0.9629493657892962 │
│               4 │  0.8445189789660359 │
│               5 │   1.102732453261797 │
│               6 │  1.1283649236954338 │
│               7 │  0.5441176470588235 │
│               8 │ 0.35076923076923083 │
│               9 │  0.8068000000000001 │
│              10 │                 0.0 │
│              65 │                 0.0 │
│              66 │                 1.5 │
│             177 │                 1.0 │
│             208 │                 0.0 │
│             247 │                 2.3 │
│             249 │                 0.0 │
│             254 │                 0.0 │
├─────────────────┴─────────────────────┤
│ 18 rows                     2 columns │
└───────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Note that we actually get a polars DataFrame as a return object. Click the callout box below to learn more.

By default, the con.query method that we are using here will return a polars DataFrame object that Python understands “natively” (i.e., has a print method for and so on). Behind the scenes, this duckdb to polars integration relies on the pyarrow library being available to our Python environment, which have already installed for this workshop.

It’s also possible return other types of Python objects. For example, you can use the .df() method to coerce to a pandas DataFrame instead, among several other formats like numpy arrays. (Details here.) Given the focus of this workshop, it won’t surprise you to hear that I’m going to stick with the default polars conversion.

So that only took 1.21 seconds in this rendered Quarto doc (and will likely be even faster when you try in an interactive session). To underscore just how crazy impressive this is, recall that this includes the time that it takes to read the data from disk. I can almost guarantee that the read + serialization time alone for traditional data wrangling workflows would take several minutes, and probably crash my laptop RAM. Don’t forget that our full dataset is nearly 200 million rows deep and 30 columns wide.

Aside: We clearly have a few outlier typos in our dataset. 254 passengers in a single taxi trip? I don’t think so. We’d probably want to filter these out with a WHERE statement if we were doing serious analysis, but I’m just going to leave them in for this tutorial.

Aggregation

Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.

tic = Sys.time()
dat2 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    month,
    passenger_count,
    AVG(tip_amount) AS mean_tip
  WHERE month <= 3
  GROUP BY ALL
  "
    )
toc = Sys.time()

head(dat2)
  month passenger_count mean_tip
1     1               1 1.036863
2     2               1 1.068490
3     3               1 1.089205
4     2               8 0.500000
5     1               8 0.000000
6     1               7 0.000000
toc - tic
Time difference of 0.373965 secs
tic = time.time()
dat2 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
    '''
  )
)
toc = time.time()

dat2
┌───────┬─────────────────┬────────────────────┐
│ month │ passenger_count │      mean_tip      │
│ int64 │      int64      │       double       │
├───────┼─────────────────┼────────────────────┤
│     1 │               8 │                0.0 │
│     2 │               8 │                0.5 │
│     1 │               7 │                0.0 │
│     1 │               3 │ 0.8752659692390742 │
│     1 │               6 │ 1.0175694433120148 │
│     2 │               3 │ 0.8948976704752726 │
│     2 │               6 │ 1.0218459360559857 │
│     3 │               6 │ 1.0515659390082825 │
│     3 │               3 │ 0.9121818858010082 │
│     1 │               1 │ 1.0368628142310818 │
│     · │               · │          ·         │
│     · │               · │          ·         │
│     · │               · │          ·         │
│     2 │               2 │ 0.9908003546925661 │
│     3 │               2 │ 1.0096468528132252 │
│     3 │             208 │                0.0 │
│     1 │             208 │                0.0 │
│     1 │               5 │  1.001198485873298 │
│     2 │               5 │ 1.0157674270380461 │
│     3 │               5 │ 1.0353911898062576 │
│     1 │              65 │                0.0 │
│     2 │               9 │                0.0 │
│     1 │               9 │                0.0 │
├───────┴─────────────────┴────────────────────┤
│ 29 rows (20 shown)                 3 columns │
└──────────────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Note that this query completed even faster than the first one, even though we added another grouping variable. Reason: Subsetting along our Hive-partitioned parquet dataset allows DuckDB to take shortcuts. We can see this directly by prepending an EXPLAIN statement to our query to reveal the optmized query plan.

dbGetQuery(
  con,
  "
  EXPLAIN
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
  "
)
physical_plan
┌───────────────────────────┐
│       HASH_GROUP_BY       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             #0            │
│             #1            │
│          avg(#2)          │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│         PROJECTION        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           month           │
│      passenger_count      │
│         tip_amount        │
└─────────────┬─────────────┘                             
┌─────────────┴─────────────┐
│       PARQUET_SCAN        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│           month           │
│      passenger_count      │
│         tip_amount        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ File Filters: (month <= 3)│
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        EC: 44907396       │
└───────────────────────────┘                             
con.query(
  '''
  EXPLAIN
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      month,
      passenger_count,
      AVG(tip_amount) AS mean_tip
    WHERE month <= 3
    GROUP BY ALL
  '''
)
┌───────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│  explain_key  │                                            explain_value                                             │
│    varchar    │                                               varchar                                                │
├───────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ physical_plan │ ┌───────────────────────────┐\n│       HASH_GROUP_BY       │\n│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │\n│    …  │
└───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘

tl;dr DuckDB is able to exploit the month partition of our dataset, so subsetting means that it can avoid unecessary data ingestion. Similarly, it only reads in a select group of columns; that’s what the “PROJECTION” part of the plan denotes. If nothing else, the take-home message is that DuckDB only does what it needs to. Laziness as a virtue!

Here’s a final aggregation example, this time including a high-dimensional grouping column (i.e., “trip_distance”) and some additional aggregations.

tic = Sys.time()
dat3 = dbGetQuery(
  con,
  "
  FROM 'nyc-taxi/**/*.parquet'
  SELECT
    passenger_count,
    trip_distance,
    AVG(tip_amount) AS mean_tip,
    AVG(fare_amount) AS mean_fare
  GROUP BY ALL
"
)
toc = Sys.time()

nrow(dat3)
[1] 25569
head(dat3)
  passenger_count trip_distance  mean_tip mean_fare
1               1          3.80 1.4617821 13.624269
2               1          2.70 1.1625732 10.811488
3               1          2.20 1.0099496  9.471702
4               1          4.70 1.6794478 15.728522
5               2          1.86 0.8184262  8.482737
6               2          2.70 0.9988106 10.910493
toc - tic
Time difference of 2.870613 secs
tic = time.time()
dat3 = (
  con.
  query(
    '''
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
    '''
    )
)
toc = time.time()

len(dat3)
25569
dat3
┌─────────────────┬───────────────┬─────────────────────┬────────────────────┐
│ passenger_count │ trip_distance │      mean_tip       │     mean_fare      │
│      int64      │    double     │       double        │       double       │
├─────────────────┼───────────────┼─────────────────────┼────────────────────┤
│               1 │           2.7 │    1.16257323977722 │ 10.811487937005044 │
│               1 │           8.5 │  3.0037572144209994 │ 23.762548777891176 │
│               2 │           2.7 │  0.9988105937071349 │ 10.910492532335036 │
│               1 │           1.8 │   0.878870484195116 │  8.402046190257858 │
│               1 │           0.2 │  0.3864478915075284 │  4.671192577145984 │
│               1 │           5.9 │  1.9316437836686586 │   18.4181575160648 │
│               1 │           9.8 │   3.602063841563186 │  26.75012573394927 │
│               1 │           4.7 │   1.679447806444995 │ 15.728522049844514 │
│               1 │           2.2 │   1.009949551490608 │   9.47170239542306 │
│               1 │          0.93 │  0.5659718504077408 │ 5.8204754998691905 │
│               · │            ·  │           ·         │          ·         │
│               · │            ·  │           ·         │          ·         │
│               · │            ·  │           ·         │          ·         │
│               5 │           3.2 │   1.213120921023685 │ 12.059595218889795 │
│               2 │          4.58 │  1.5289010022859153 │  15.32215579391595 │
│               3 │          1.34 │  0.6439547586916009 │  6.937227684596117 │
│               5 │          0.69 │  0.4423211070274535 │  4.844665893786588 │
│               1 │         17.93 │   4.768273278630039 │ 48.079521940777724 │
│               6 │          0.49 │ 0.36281714343475185 │    4.2202365308804 │
│               6 │          1.53 │  0.7403351611114536 │  7.555714373729288 │
│               1 │         18.06 │   4.738210227272728 │ 48.059339488636354 │
│               5 │          13.6 │  3.7402173913043484 │  36.93664596273292 │
│               1 │         12.75 │   3.638386937159822 │  34.57689262741217 │
├─────────────────┴───────────────┴─────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                                    4 columns │
└────────────────────────────────────────────────────────────────────────────┘
# print(f"Time difference of {toc - tic} seconds")
## Timing will be misleading for this rendered Quarto doc, since we're calling
## Python from R (via the reticulate package).

Pivot (reshape)

Let’s explore some pivot (reshape) examples, by building off the previous query.

  • UNPIVOT: wide => long
  • PIVOT: long => wide

Here I’ll use a Common Table Expression (CTE) to define a temporary table tmp_table, before unpivoting—i.e., reshaping long—at the end.

dat_long = dbGetQuery(
  con,
  "
  WITH tmp_table AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
  )
  UNPIVOT tmp_table
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  "
)

head(dat_long)
  passenger_count trip_distance  variable     amount
1               1           1.1  mean_tip  0.6530588
2               1           1.1 mean_fare  6.3333552
3               2          18.0  mean_tip  3.7746336
4               2          18.0 mean_fare 48.1438661
5               1           1.2  mean_tip  0.6882080
6               1           1.2 mean_fare  6.6407479
dat_long = (
  con.
  query(
    '''
    WITH tmp_table AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        passenger_count,
        trip_distance,
        AVG(tip_amount) AS mean_tip,
        AVG(fare_amount) AS mean_fare
      GROUP BY ALL
    )
    UNPIVOT tmp_table
    ON mean_tip, mean_fare
    INTO
      NAME variable
      VALUE amount
    '''
  )
)

dat_long
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable  │       amount       │
│      int64      │    double     │  varchar  │       double       │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│               1 │           0.9 │ mean_tip  │  0.578974821777743 │
│               1 │           0.9 │ mean_fare │  5.687595795787245 │
│               1 │          11.1 │ mean_tip  │  3.839358215996518 │
│               1 │          11.1 │ mean_fare │ 29.974972782697066 │
│               1 │           4.0 │ mean_tip  │ 1.5167455372913883 │
│               1 │           4.0 │ mean_fare │ 14.098943870659395 │
│               1 │          17.3 │ mean_tip  │  5.590380869171258 │
│               1 │          17.3 │ mean_fare │ 47.364542924635145 │
│               1 │          0.71 │ mean_tip  │ 0.4766168470138103 │
│               1 │          0.71 │ mean_fare │  5.104461268390179 │
│               · │            ·  │    ·      │           ·        │
│               · │            ·  │    ·      │           ·        │
│               · │            ·  │    ·      │           ·        │
│               5 │          12.7 │ mean_tip  │   3.07948347107438 │
│               5 │          12.7 │ mean_fare │  33.64896694214877 │
│               2 │         12.25 │ mean_tip  │ 3.1529553264604813 │
│               2 │         12.25 │ mean_fare │  32.42886597938145 │
│               6 │          9.22 │ mean_tip  │  3.173150684931507 │
│               6 │          9.22 │ mean_fare │ 25.668340943683408 │
│               6 │          9.34 │ mean_tip  │ 2.8852453987730065 │
│               6 │          9.34 │ mean_fare │  25.80444785276073 │
│               1 │          33.5 │ mean_tip  │  5.801387283236995 │
│               1 │          33.5 │ mean_fare │  56.67167630057804 │
├─────────────────┴───────────────┴───────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown)                          4 columns │
└──────────────────────────────────────────────────────────────────┘

Another option would have been to create a new table in memory and then pivot over that, which segues nicely to…

Digression: Create new tables

CTEs are a very common, and often efficient, way to implement multi-table operations in SQL. But, for the record, we can create new tables in DuckDB’s memory cache pretty easily using the CREATE TABLE statement.

Instead of DBI::dbGetQuery, we must now use DBI::dbExecute.

dbExecute(
    con,
    "
    CREATE TABLE taxi2 AS
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        passenger_count,
        trip_distance,
        AVG(tip_amount) AS mean_tip,
        AVG(fare_amount) AS mean_fare
      GROUP BY ALL
    "
)
[1] 25569
dbListTables(con)
[1] "taxi2"

FWIW, you can always remove a table with dbRemoveTable().

Instead of con.query(), we must now use con.execute().

con.execute(
  '''
  CREATE TABLE taxi2 AS
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      passenger_count,
      trip_distance,
      AVG(tip_amount) AS mean_tip,
      AVG(fare_amount) AS mean_fare
    GROUP BY ALL
  '''
)
<duckdb.duckdb.DuckDBPyConnection object at 0x1062d3cb0>
# https://stackoverflow.com/q/75727685
con.query(
  '''
  SELECT table_name, estimated_size AS nrows, column_count AS ncols
  FROM duckdb_tables;
  '''
)
┌────────────┬───────┬───────┐
│ table_name │ nrows │ ncols │
│  varchar   │ int64 │ int64 │
├────────────┼───────┼───────┤
│ taxi2      │ 25569 │     4 │
└────────────┴───────┴───────┘

Back to reshaping

With our new taxi2 table in hand, let’s redo the previous unpivot query directly on this new table. This makes the actual (un)pivot statement a bit clearer… and also separates out the execution time.

dbGetQuery(
  con,
  "
  UNPIVOT taxi2
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  LIMIT 5
  "
)
  passenger_count trip_distance  variable    amount
1               1           1.1  mean_tip 0.6530588
2               1           1.1 mean_fare 6.3333552
3               1           1.2  mean_tip 0.6882080
4               1           1.2 mean_fare 6.6407479
5               1           1.3  mean_tip 0.7207278
con.query(
  '''
  UNPIVOT taxi2
  ON mean_tip, mean_fare
  INTO
    NAME variable
    VALUE amount
  LIMIT 5
  '''
)
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable  │       amount       │
│      int64      │    double     │  varchar  │       double       │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│               1 │           2.7 │ mean_tip  │   1.16257323977722 │
│               1 │           2.7 │ mean_fare │  10.81148793700505 │
│               0 │           9.1 │ mean_tip  │  2.399264497878359 │
│               0 │           9.1 │ mean_fare │ 23.355728429985852 │
│               1 │           1.8 │ mean_tip  │ 0.8788704841951157 │
└─────────────────┴───────────────┴───────────┴────────────────────┘

(Note how crazy fast pivoting in DuckDB actually is.)

Joins (merges)

It’s a bit hard to demonstrate a join with only a single main table. But here is a contrived example, where we calculate the mean monthly tips and the mean monthly fares as separate sub-queries (CTEs), before joining them together by month.

dbGetQuery(
  con,
  "
  WITH 
    mean_tips AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(tip_amount) AS mean_tip
      GROUP BY month
    ),
    mean_fares AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(fare_amount) AS mean_fare
      GROUP BY month 
    )
  FROM mean_tips
  LEFT JOIN mean_fares
  USING (month)
  SELECT *
  ORDER BY mean_tips.month
  "
)
   month mean_tip mean_fare
1      1 1.007817  9.813488
2      2 1.036874  9.942640
3      3 1.056353 10.223107
4      4 1.043167 10.335490
5      5 1.078014 10.585157
6      6 1.091082 10.548651
7      7 1.059312 10.379943
8      8 1.079521 10.492650
9      9 1.254601 12.391198
10    10 1.281239 12.501252
11    11 1.250903 12.270138
12    12 1.237651 12.313953
con.query(
  '''
  WITH 
    mean_tips AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(tip_amount) AS mean_tip
      GROUP BY month
    ),
    mean_fares AS (
      FROM 'nyc-taxi/**/*.parquet'
      SELECT
        month,
        AVG(fare_amount) AS mean_fare
      GROUP BY month 
    )
  FROM mean_tips
  LEFT JOIN mean_fares
  USING (month)
  SELECT *
  ORDER BY mean_tips.month
  '''
)
┌───────┬────────────────────┬────────────────────┐
│ month │      mean_tip      │     mean_fare      │
│ int64 │       double       │       double       │
├───────┼────────────────────┼────────────────────┤
│     1 │ 1.0078165246989366 │  9.813487671828813 │
│     2 │ 1.0368737381553987 │  9.942640301300228 │
│     3 │   1.05635274287244 │ 10.223107216153554 │
│     4 │ 1.0431674901411208 │ 10.335489610549338 │
│     5 │ 1.0780143169836092 │ 10.585156844134143 │
│     6 │  1.091082009381275 │ 10.548651231531922 │
│     7 │  1.059312239456315 │ 10.379943069577804 │
│     8 │ 1.0795208991227114 │ 10.492650001890153 │
│     9 │ 1.2546008978994332 │ 12.391197540031698 │
│    10 │ 1.2812392796882088 │ 12.501252484194163 │
│    11 │ 1.2509031985269687 │ 12.270137514944446 │
│    12 │ 1.2376507362291407 │ 12.313952857613234 │
├───────┴────────────────────┴────────────────────┤
│ 12 rows                               3 columns │
└─────────────────────────────────────────────────┘
Challenge

Redo the above join but, rather than using CTEs, use tables that you first create in DuckDB’s memory bank. Again, this will simplify the actual join operation and also emphasise how crazy fast joins are in DuckDB.

Windowing

One last example: Binning “trip_distance” into deciles and then calculating the the mean “tip_amount” within each decile. This is an example of a window function and query pattern that I use all the time in my own work. I find it extremely useful for quickly pulling out descriptive patterns from large datasets, from which I can then develop a better intuition of my data. In turn, this shapes the hypotheses and modeling choices that I make in the subsequent analysis stage.

Sorting and sampling

I’m using a 1% random sample of my data here (see the USING SAMPLE 1% statement). Why? Because calculating deciles requires ranking your data and this is expensive! To rank data, we first have to sort it (ORDER) and this requires evaluating/comparing every single row in your dataset. In turn, this means that it’s very hard to take shortcuts. (This is one reason why DuckDB’s optimized query plan will always delay sorting until as late as possible; to only sort on a smaller subset/aggregation of the data if possible.) FWIW, DuckDB’s sorting algorithm is still crazy fast. But for data of this size, and where sorting on the full datset is unavoidable, I strongly recommend sampling first. Your general insights will almost certainly remain intact.

dbGetQuery(
  con,
  "
  WITH trip_deciles AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      tip_amount,
      trip_distance,
      NTILE(10) OVER (ORDER BY trip_distance) AS decile
    USING SAMPLE 1%
  )
  FROM trip_deciles
  SELECT
    decile,
    AVG(trip_distance) AS mean_distance,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  "
)
   decile mean_distance  mean_tip
1       1     0.4485001 0.5815993
2       2     0.8077842 0.5140569
3       3     1.0636530 0.5985021
4       4     1.3244716 0.6870273
5       5     1.6286108 0.7805147
6       6     1.9959871 0.8992282
7       7     2.4957121 1.0367072
8       8     3.2551231 1.2510039
9       9     4.7643520 1.6101116
10     10    11.0504331 3.2807728
con.query(
  '''
  WITH trip_deciles AS (
    FROM 'nyc-taxi/**/*.parquet'
    SELECT
      tip_amount,
      trip_distance,
      NTILE(10) OVER (ORDER BY trip_distance) AS decile
    USING SAMPLE 1%
  )
  FROM trip_deciles
  SELECT
    decile,
    AVG(trip_distance) AS mean_distance,
    AVG(tip_amount) AS mean_tip
  GROUP BY ALL
  ORDER BY ALL
  '''
)
┌────────┬─────────────────────┬────────────────────┐
│ decile │    mean_distance    │      mean_tip      │
│ int64  │       double        │       double       │
├────────┼─────────────────────┼────────────────────┤
│      1 │ 0.44922310548259337 │ 0.5799620572865318 │
│      2 │  0.8087209464724292 │ 0.5141380253494722 │
│      3 │  1.0653553928716593 │ 0.6061122685320124 │
│      4 │    1.32654500168768 │ 0.6880050396303666 │
│      5 │  1.6308654663142164 │ 0.7849700881062599 │
│      6 │  1.9978597283474906 │ 0.8918140923428113 │
│      7 │    2.49924847821658 │  1.044309233114935 │
│      8 │   3.258975663125616 │    1.2555553486423 │
│      9 │   4.760698918160984 │   1.62306336819192 │
│     10 │  11.029094432514512 │  3.272133126160228 │
├────────┴─────────────────────┴────────────────────┤
│ 10 rows                                 3 columns │
└───────────────────────────────────────────────────┘

Close connection

dbDisconnect(con)

Again, this step isn’t strictly necessary since we instantiated our connection with the shutdown = TRUE argument. But it’s worth seeing in case you want to be explicit.

con.close()

Footnotes

  1. Aside: The shutdown = TRUE argument is a convenience feature that ensures our connection is automatically terminated when our R session ends (i.e., even if we forget to do it manually.) I’m not aware of a similar convenience argument for Python; please let me know if I am missing something.↩︎

  2. I’ll admit that the benefits don’t seem so great for this simple example. But trust me: they make a big difference once you start having lots of grouping columns and complex sub-queries.↩︎