library(duckdb)
Loading required package: DBI
Use the same queries from R, Python, Julia…
The first thing we need to do is instantiate a connection with an in-memory database.1
The fact that our connection lives “in memory” is a killer feature of DuckDB (one that it inherits from SQLite). We don’t need to connect to some complicated, existing database infrastructure to harness all of DuckDB’s power. Instead we can just spin up an ephemeral database that interacts directly with our R (or Python, or Julia, etc.) client.
However, it’s worth noting that you can create a persistent, disk-backed database simply by providing a database file path argument as part of your connection, e.g.
(Note that the ".duck"
file extension above is optional. You could also use ".db"
, ".dbb"
, or whatever you want really.)
One really important benefit of creating a persistent disk-backed database is that it enables out-of-core computation for bigger than RAM data. See here for more details and performance considerations (which are still great).
We’ll start with a simple aggregation query to get situated. I’ll also use this example to highlight some general features of DuckDB SQL and the underlying query engine.
Okay, first query. Let’s say we want to know: What is the average tip per passenger count? A typical SQL job for this question might look as follows:
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
FROM 'nyc-taxi/**/*.parquet'
GROUP BY passenger_count
ORDER BY passenger_count
(Where the last ORDER BY
statement is optional. Note that ordering (i.e., sorting) is a potentially expensive operation but we’ll get back to that later.)
This is perfectly valid DuckDB SQL too. However, we can rewrite it with slightly nicer syntax thanks DuckDB’s “friendly SQL”. The key changes for this simple query are going to be: (1) the FROM
statement comes first, and (2) we can use the GROUP BY ALL
and ORDER BY ALL
statements to avoid writing out the “passenger_count” grouping column multiple times.2
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
One of the under-appreciated (IMHO) features of DuckDB is that it supports many syntax enhancements over tradional SQL dialects, which they collectively dub “friendly SQL”. Together these syntax enhancements allow you to write much more ergonomic SQL queries that cut down on duplication and logical inconsistencies.
To run this operation from our R or Python client, simply pass the SQL query as a string to our connection. Let’s use this as a chance to save the result and time our query too.
tic = Sys.time()
dat1 = dbGetQuery(
con,
"
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
"
)
toc = Sys.time()
dat1
passenger_count mean_tip
1 0 0.8620988
2 1 1.1510110
3 2 1.0815798
4 3 0.9629494
5 4 0.8445190
6 5 1.1027325
7 6 1.1283649
8 7 0.5441176
9 8 0.3507692
10 9 0.8068000
11 10 0.0000000
12 65 0.0000000
13 66 1.5000000
14 177 1.0000000
15 208 0.0000000
16 247 2.3000000
17 249 0.0000000
18 254 0.0000000
Time difference of 1.211972 secs
tic = time.time()
dat1 = (
con.
query(
'''
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
'''
)
)
toc = time.time()
dat1
┌─────────────────┬─────────────────────┐
│ passenger_count │ mean_tip │
│ int64 │ double │
├─────────────────┼─────────────────────┤
│ 0 │ 0.8620988141424404 │
│ 1 │ 1.1510109615454076 │
│ 2 │ 1.0815798424001326 │
│ 3 │ 0.9629493657892962 │
│ 4 │ 0.8445189789660359 │
│ 5 │ 1.102732453261797 │
│ 6 │ 1.1283649236954338 │
│ 7 │ 0.5441176470588235 │
│ 8 │ 0.35076923076923083 │
│ 9 │ 0.8068000000000001 │
│ 10 │ 0.0 │
│ 65 │ 0.0 │
│ 66 │ 1.5 │
│ 177 │ 1.0 │
│ 208 │ 0.0 │
│ 247 │ 2.3 │
│ 249 │ 0.0 │
│ 254 │ 0.0 │
├─────────────────┴─────────────────────┤
│ 18 rows 2 columns │
└───────────────────────────────────────┘
Note that we actually get a polars DataFrame as a return object. Click the callout box below to learn more.
By default, the con.query
method that we are using here will return a polars DataFrame object that Python understands “natively” (i.e., has a print method for and so on). Behind the scenes, this duckdb to polars integration relies on the pyarrow library being available to our Python environment, which have already installed for this workshop.
It’s also possible return other types of Python objects. For example, you can use the .df()
method to coerce to a pandas DataFrame instead, among several other formats like numpy arrays. (Details here.) Given the focus of this workshop, it won’t surprise you to hear that I’m going to stick with the default polars conversion.
So that only took 1.21 seconds in this rendered Quarto doc (and will likely be even faster when you try in an interactive session). To underscore just how crazy impressive this is, recall that this includes the time that it takes to read the data from disk. I can almost guarantee that the read + serialization time alone for traditional data wrangling workflows would take several minutes, and probably crash my laptop RAM. Don’t forget that our full dataset is nearly 200 million rows deep and 30 columns wide.
Aside: We clearly have a few outlier typos in our dataset. 254 passengers in a single taxi trip? I don’t think so. We’d probably want to filter these out with a WHERE
statement if we were doing serious analysis, but I’m just going to leave them in for this tutorial.
Let’s try out some more aggregation queries. How about a slightly variation on a our first example query, where we (a) add “month” as a second grouping variable, and (b) subset to only the first three months of the year.
tic = Sys.time()
dat2 = dbGetQuery(
con,
"
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
"
)
toc = Sys.time()
head(dat2)
month passenger_count mean_tip
1 1 1 1.036863
2 2 1 1.068490
3 3 1 1.089205
4 2 8 0.500000
5 1 8 0.000000
6 1 7 0.000000
Time difference of 0.373965 secs
tic = time.time()
dat2 = (
con.
query(
'''
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
'''
)
)
toc = time.time()
dat2
┌───────┬─────────────────┬────────────────────┐
│ month │ passenger_count │ mean_tip │
│ int64 │ int64 │ double │
├───────┼─────────────────┼────────────────────┤
│ 1 │ 8 │ 0.0 │
│ 2 │ 8 │ 0.5 │
│ 1 │ 7 │ 0.0 │
│ 1 │ 3 │ 0.8752659692390742 │
│ 1 │ 6 │ 1.0175694433120148 │
│ 2 │ 3 │ 0.8948976704752726 │
│ 2 │ 6 │ 1.0218459360559857 │
│ 3 │ 6 │ 1.0515659390082825 │
│ 3 │ 3 │ 0.9121818858010082 │
│ 1 │ 1 │ 1.0368628142310818 │
│ · │ · │ · │
│ · │ · │ · │
│ · │ · │ · │
│ 2 │ 2 │ 0.9908003546925661 │
│ 3 │ 2 │ 1.0096468528132252 │
│ 3 │ 208 │ 0.0 │
│ 1 │ 208 │ 0.0 │
│ 1 │ 5 │ 1.001198485873298 │
│ 2 │ 5 │ 1.0157674270380461 │
│ 3 │ 5 │ 1.0353911898062576 │
│ 1 │ 65 │ 0.0 │
│ 2 │ 9 │ 0.0 │
│ 1 │ 9 │ 0.0 │
├───────┴─────────────────┴────────────────────┤
│ 29 rows (20 shown) 3 columns │
└──────────────────────────────────────────────┘
Note that this query completed even faster than the first one, even though we added another grouping variable. Reason: Subsetting along our Hive-partitioned parquet dataset allows DuckDB to take shortcuts. We can see this directly by prepending an EXPLAIN
statement to our query to reveal the optmized query plan.
dbGetQuery(
con,
"
EXPLAIN
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
"
)
physical_plan
┌───────────────────────────┐
│ HASH_GROUP_BY │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ #0 │
│ #1 │
│ avg(#2) │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PROJECTION │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ month │
│ passenger_count │
│ tip_amount │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│ PARQUET_SCAN │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ month │
│ passenger_count │
│ tip_amount │
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ File Filters: (month <= 3)│
│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │
│ EC: 44907396 │
└───────────────────────────┘
con.query(
'''
EXPLAIN
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
passenger_count,
AVG(tip_amount) AS mean_tip
WHERE month <= 3
GROUP BY ALL
'''
)
┌───────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ explain_key │ explain_value │
│ varchar │ varchar │
├───────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ physical_plan │ ┌───────────────────────────┐\n│ HASH_GROUP_BY │\n│ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ │\n│ … │
└───────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────┘
tl;dr DuckDB is able to exploit the month partition of our dataset, so subsetting means that it can avoid unecessary data ingestion. Similarly, it only reads in a select group of columns; that’s what the “PROJECTION” part of the plan denotes. If nothing else, the take-home message is that DuckDB only does what it needs to. Laziness as a virtue!
Here’s a final aggregation example, this time including a high-dimensional grouping column (i.e., “trip_distance”) and some additional aggregations.
tic = Sys.time()
dat3 = dbGetQuery(
con,
"
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
"
)
toc = Sys.time()
nrow(dat3)
[1] 25569
passenger_count trip_distance mean_tip mean_fare
1 1 3.80 1.4617821 13.624269
2 1 2.70 1.1625732 10.811488
3 1 2.20 1.0099496 9.471702
4 1 4.70 1.6794478 15.728522
5 2 1.86 0.8184262 8.482737
6 2 2.70 0.9988106 10.910493
Time difference of 2.870613 secs
tic = time.time()
dat3 = (
con.
query(
'''
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
'''
)
)
toc = time.time()
len(dat3)
25569
┌─────────────────┬───────────────┬─────────────────────┬────────────────────┐
│ passenger_count │ trip_distance │ mean_tip │ mean_fare │
│ int64 │ double │ double │ double │
├─────────────────┼───────────────┼─────────────────────┼────────────────────┤
│ 1 │ 2.7 │ 1.16257323977722 │ 10.811487937005044 │
│ 1 │ 8.5 │ 3.0037572144209994 │ 23.762548777891176 │
│ 2 │ 2.7 │ 0.9988105937071349 │ 10.910492532335036 │
│ 1 │ 1.8 │ 0.878870484195116 │ 8.402046190257858 │
│ 1 │ 0.2 │ 0.3864478915075284 │ 4.671192577145984 │
│ 1 │ 5.9 │ 1.9316437836686586 │ 18.4181575160648 │
│ 1 │ 9.8 │ 3.602063841563186 │ 26.75012573394927 │
│ 1 │ 4.7 │ 1.679447806444995 │ 15.728522049844514 │
│ 1 │ 2.2 │ 1.009949551490608 │ 9.47170239542306 │
│ 1 │ 0.93 │ 0.5659718504077408 │ 5.8204754998691905 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 5 │ 3.2 │ 1.213120921023685 │ 12.059595218889795 │
│ 2 │ 4.58 │ 1.5289010022859153 │ 15.32215579391595 │
│ 3 │ 1.34 │ 0.6439547586916009 │ 6.937227684596117 │
│ 5 │ 0.69 │ 0.4423211070274535 │ 4.844665893786588 │
│ 1 │ 17.93 │ 4.768273278630039 │ 48.079521940777724 │
│ 6 │ 0.49 │ 0.36281714343475185 │ 4.2202365308804 │
│ 6 │ 1.53 │ 0.7403351611114536 │ 7.555714373729288 │
│ 1 │ 18.06 │ 4.738210227272728 │ 48.059339488636354 │
│ 5 │ 13.6 │ 3.7402173913043484 │ 36.93664596273292 │
│ 1 │ 12.75 │ 3.638386937159822 │ 34.57689262741217 │
├─────────────────┴───────────────┴─────────────────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown) 4 columns │
└────────────────────────────────────────────────────────────────────────────┘
Let’s explore some pivot (reshape) examples, by building off the previous query.
UNPIVOT
: wide => longPIVOT
: long => wideHere I’ll use a Common Table Expression (CTE) to define a temporary table tmp_table
, before unpivoting—i.e., reshaping long—at the end.
dat_long = dbGetQuery(
con,
"
WITH tmp_table AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
)
UNPIVOT tmp_table
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
"
)
head(dat_long)
passenger_count trip_distance variable amount
1 1 1.1 mean_tip 0.6530588
2 1 1.1 mean_fare 6.3333552
3 2 18.0 mean_tip 3.7746336
4 2 18.0 mean_fare 48.1438661
5 1 1.2 mean_tip 0.6882080
6 1 1.2 mean_fare 6.6407479
dat_long = (
con.
query(
'''
WITH tmp_table AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
)
UNPIVOT tmp_table
ON mean_tip, mean_fare
INTO
NAME variable
VALUE amount
'''
)
)
dat_long
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable │ amount │
│ int64 │ double │ varchar │ double │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│ 1 │ 0.9 │ mean_tip │ 0.578974821777743 │
│ 1 │ 0.9 │ mean_fare │ 5.687595795787245 │
│ 1 │ 11.1 │ mean_tip │ 3.839358215996518 │
│ 1 │ 11.1 │ mean_fare │ 29.974972782697066 │
│ 1 │ 4.0 │ mean_tip │ 1.5167455372913883 │
│ 1 │ 4.0 │ mean_fare │ 14.098943870659395 │
│ 1 │ 17.3 │ mean_tip │ 5.590380869171258 │
│ 1 │ 17.3 │ mean_fare │ 47.364542924635145 │
│ 1 │ 0.71 │ mean_tip │ 0.4766168470138103 │
│ 1 │ 0.71 │ mean_fare │ 5.104461268390179 │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ · │ · │ · │ · │
│ 5 │ 12.7 │ mean_tip │ 3.07948347107438 │
│ 5 │ 12.7 │ mean_fare │ 33.64896694214877 │
│ 2 │ 12.25 │ mean_tip │ 3.1529553264604813 │
│ 2 │ 12.25 │ mean_fare │ 32.42886597938145 │
│ 6 │ 9.22 │ mean_tip │ 3.173150684931507 │
│ 6 │ 9.22 │ mean_fare │ 25.668340943683408 │
│ 6 │ 9.34 │ mean_tip │ 2.8852453987730065 │
│ 6 │ 9.34 │ mean_fare │ 25.80444785276073 │
│ 1 │ 33.5 │ mean_tip │ 5.801387283236995 │
│ 1 │ 33.5 │ mean_fare │ 56.67167630057804 │
├─────────────────┴───────────────┴───────────┴────────────────────┤
│ ? rows (>9999 rows, 20 shown) 4 columns │
└──────────────────────────────────────────────────────────────────┘
Another option would have been to create a new table in memory and then pivot over that, which segues nicely to…
CTEs are a very common, and often efficient, way to implement multi-table operations in SQL. But, for the record, we can create new tables in DuckDB’s memory cache pretty easily using the CREATE TABLE
statement.
Instead of DBI::dbGetQuery
, we must now use DBI::dbExecute
.
dbExecute(
con,
"
CREATE TABLE taxi2 AS
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
"
)
[1] 25569
[1] "taxi2"
FWIW, you can always remove a table with dbRemoveTable()
.
Instead of con.query()
, we must now use con.execute()
.
con.execute(
'''
CREATE TABLE taxi2 AS
FROM 'nyc-taxi/**/*.parquet'
SELECT
passenger_count,
trip_distance,
AVG(tip_amount) AS mean_tip,
AVG(fare_amount) AS mean_fare
GROUP BY ALL
'''
)
<duckdb.duckdb.DuckDBPyConnection object at 0x1062d3cb0>
# https://stackoverflow.com/q/75727685
con.query(
'''
SELECT table_name, estimated_size AS nrows, column_count AS ncols
FROM duckdb_tables;
'''
)
┌────────────┬───────┬───────┐
│ table_name │ nrows │ ncols │
│ varchar │ int64 │ int64 │
├────────────┼───────┼───────┤
│ taxi2 │ 25569 │ 4 │
└────────────┴───────┴───────┘
With our new taxi2
table in hand, let’s redo the previous unpivot query directly on this new table. This makes the actual (un)pivot statement a bit clearer… and also separates out the execution time.
┌─────────────────┬───────────────┬───────────┬────────────────────┐
│ passenger_count │ trip_distance │ variable │ amount │
│ int64 │ double │ varchar │ double │
├─────────────────┼───────────────┼───────────┼────────────────────┤
│ 1 │ 2.7 │ mean_tip │ 1.16257323977722 │
│ 1 │ 2.7 │ mean_fare │ 10.81148793700505 │
│ 0 │ 9.1 │ mean_tip │ 2.399264497878359 │
│ 0 │ 9.1 │ mean_fare │ 23.355728429985852 │
│ 1 │ 1.8 │ mean_tip │ 0.8788704841951157 │
└─────────────────┴───────────────┴───────────┴────────────────────┘
(Note how crazy fast pivoting in DuckDB actually is.)
It’s a bit hard to demonstrate a join with only a single main table. But here is a contrived example, where we calculate the mean monthly tips and the mean monthly fares as separate sub-queries (CTEs), before joining them together by month.
dbGetQuery(
con,
"
WITH
mean_tips AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(tip_amount) AS mean_tip
GROUP BY month
),
mean_fares AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(fare_amount) AS mean_fare
GROUP BY month
)
FROM mean_tips
LEFT JOIN mean_fares
USING (month)
SELECT *
ORDER BY mean_tips.month
"
)
month mean_tip mean_fare
1 1 1.007817 9.813488
2 2 1.036874 9.942640
3 3 1.056353 10.223107
4 4 1.043167 10.335490
5 5 1.078014 10.585157
6 6 1.091082 10.548651
7 7 1.059312 10.379943
8 8 1.079521 10.492650
9 9 1.254601 12.391198
10 10 1.281239 12.501252
11 11 1.250903 12.270138
12 12 1.237651 12.313953
con.query(
'''
WITH
mean_tips AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(tip_amount) AS mean_tip
GROUP BY month
),
mean_fares AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
month,
AVG(fare_amount) AS mean_fare
GROUP BY month
)
FROM mean_tips
LEFT JOIN mean_fares
USING (month)
SELECT *
ORDER BY mean_tips.month
'''
)
┌───────┬────────────────────┬────────────────────┐
│ month │ mean_tip │ mean_fare │
│ int64 │ double │ double │
├───────┼────────────────────┼────────────────────┤
│ 1 │ 1.0078165246989366 │ 9.813487671828813 │
│ 2 │ 1.0368737381553987 │ 9.942640301300228 │
│ 3 │ 1.05635274287244 │ 10.223107216153554 │
│ 4 │ 1.0431674901411208 │ 10.335489610549338 │
│ 5 │ 1.0780143169836092 │ 10.585156844134143 │
│ 6 │ 1.091082009381275 │ 10.548651231531922 │
│ 7 │ 1.059312239456315 │ 10.379943069577804 │
│ 8 │ 1.0795208991227114 │ 10.492650001890153 │
│ 9 │ 1.2546008978994332 │ 12.391197540031698 │
│ 10 │ 1.2812392796882088 │ 12.501252484194163 │
│ 11 │ 1.2509031985269687 │ 12.270137514944446 │
│ 12 │ 1.2376507362291407 │ 12.313952857613234 │
├───────┴────────────────────┴────────────────────┤
│ 12 rows 3 columns │
└─────────────────────────────────────────────────┘
Redo the above join but, rather than using CTEs, use tables that you first create in DuckDB’s memory bank. Again, this will simplify the actual join operation and also emphasise how crazy fast joins are in DuckDB.
One last example: Binning “trip_distance” into deciles and then calculating the the mean “tip_amount” within each decile. This is an example of a window function and query pattern that I use all the time in my own work. I find it extremely useful for quickly pulling out descriptive patterns from large datasets, from which I can then develop a better intuition of my data. In turn, this shapes the hypotheses and modeling choices that I make in the subsequent analysis stage.
I’m using a 1% random sample of my data here (see the USING SAMPLE 1%
statement). Why? Because calculating deciles requires ranking your data and this is expensive! To rank data, we first have to sort it (ORDER
) and this requires evaluating/comparing every single row in your dataset. In turn, this means that it’s very hard to take shortcuts. (This is one reason why DuckDB’s optimized query plan will always delay sorting until as late as possible; to only sort on a smaller subset/aggregation of the data if possible.) FWIW, DuckDB’s sorting algorithm is still crazy fast. But for data of this size, and where sorting on the full datset is unavoidable, I strongly recommend sampling first. Your general insights will almost certainly remain intact.
dbGetQuery(
con,
"
WITH trip_deciles AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
tip_amount,
trip_distance,
NTILE(10) OVER (ORDER BY trip_distance) AS decile
USING SAMPLE 1%
)
FROM trip_deciles
SELECT
decile,
AVG(trip_distance) AS mean_distance,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
"
)
decile mean_distance mean_tip
1 1 0.4485001 0.5815993
2 2 0.8077842 0.5140569
3 3 1.0636530 0.5985021
4 4 1.3244716 0.6870273
5 5 1.6286108 0.7805147
6 6 1.9959871 0.8992282
7 7 2.4957121 1.0367072
8 8 3.2551231 1.2510039
9 9 4.7643520 1.6101116
10 10 11.0504331 3.2807728
con.query(
'''
WITH trip_deciles AS (
FROM 'nyc-taxi/**/*.parquet'
SELECT
tip_amount,
trip_distance,
NTILE(10) OVER (ORDER BY trip_distance) AS decile
USING SAMPLE 1%
)
FROM trip_deciles
SELECT
decile,
AVG(trip_distance) AS mean_distance,
AVG(tip_amount) AS mean_tip
GROUP BY ALL
ORDER BY ALL
'''
)
┌────────┬─────────────────────┬────────────────────┐
│ decile │ mean_distance │ mean_tip │
│ int64 │ double │ double │
├────────┼─────────────────────┼────────────────────┤
│ 1 │ 0.44922310548259337 │ 0.5799620572865318 │
│ 2 │ 0.8087209464724292 │ 0.5141380253494722 │
│ 3 │ 1.0653553928716593 │ 0.6061122685320124 │
│ 4 │ 1.32654500168768 │ 0.6880050396303666 │
│ 5 │ 1.6308654663142164 │ 0.7849700881062599 │
│ 6 │ 1.9978597283474906 │ 0.8918140923428113 │
│ 7 │ 2.49924847821658 │ 1.044309233114935 │
│ 8 │ 3.258975663125616 │ 1.2555553486423 │
│ 9 │ 4.760698918160984 │ 1.62306336819192 │
│ 10 │ 11.029094432514512 │ 3.272133126160228 │
├────────┴─────────────────────┴────────────────────┤
│ 10 rows 3 columns │
└───────────────────────────────────────────────────┘
Aside: The shutdown = TRUE
argument is a convenience feature that ensures our connection is automatically terminated when our R session ends (i.e., even if we forget to do it manually.) I’m not aware of a similar convenience argument for Python; please let me know if I am missing something.↩︎
I’ll admit that the benefits don’t seem so great for this simple example. But trust me: they make a big difference once you start having lots of grouping columns and complex sub-queries.↩︎