DuckDB + Ibis (Python)

Use a dplyr-esque Python frontend

Load libraries

import ibis
import ibis.selectors as s
from ibis import _
# ibis.options.interactive = True # enforce eager execution of queries

Connect and register

## Instantiate an in-memory DuckDB connection from Ibis
con = ibis.duckdb.connect()

## Register our parquet dataset as a table called "nyc" in our connection
con.register("nyc-taxi/**/*.parquet", "nyc")

DatabaseTable: nyc
  vendor_name           string
  pickup_datetime       timestamp
  dropoff_datetime      timestamp
  passenger_count       int64
  trip_distance         float64
  pickup_longitude      float64
  pickup_latitude       float64
  rate_code             string
  store_and_fwd         string
  dropoff_longitude     float64
  dropoff_latitude      float64
  payment_type          string
  fare_amount           float64
  extra                 float64
  mta_tax               float64
  tip_amount            float64
  tolls_amount          float64
  total_amount          float64
  improvement_surcharge float64
  congestion_surcharge  float64
  pickup_location_id    int64
  dropoff_location_id   int64
  month                 int64
  year                  int64

Aside: Remember that you can create a persistent, disk-backed database by giving it an appropriate name/path. This also enables out-of-core computation for bigger than RAM data.

# con = ibis.duckdb.connect("nyc.dbb")
# con.register("nyc-taxi/**/*.parquet", "nyc")
# etc.

Reference the table from Python. We’ll call this reference object nyc too for consistency, but you could call it whatever you want (e.g., you could call it nyc_ibis to avoid potential ambiguity with the “nyc” table in our actual DuckDB connection). Printing the object to screen will give us a lazy preview.

# con.list_tables() # Optional: confirm that our table is available

nyc = con.table("nyc")
nyc

DatabaseTable: nyc
  vendor_name           string
  pickup_datetime       timestamp
  dropoff_datetime      timestamp
  passenger_count       int64
  trip_distance         float64
  pickup_longitude      float64
  pickup_latitude       float64
  rate_code             string
  store_and_fwd         string
  dropoff_longitude     float64
  dropoff_latitude      float64
  payment_type          string
  fare_amount           float64
  extra                 float64
  mta_tax               float64
  tip_amount            float64
  tolls_amount          float64
  total_amount          float64
  improvement_surcharge float64
  congestion_surcharge  float64
  pickup_location_id    int64
  dropoff_location_id   int64
  month                 int64
  year                  int64

First example

q1 = (
  nyc
  .group_by(["passenger_count"])
  .agg(mean_tip = _.tip_amount.mean())
)

To see the underlying SQL translation, use ibis.to_sql()

ibis.to_sql(q1)

SELECT
  t0.passenger_count,
  AVG(t0.tip_amount) AS mean_tip
FROM main.nyc AS t0
GROUP BY
  1

To actually execute the query and bring the result into Python, we can use the execute() method. By default this will coerce to a pandas DataFrame.

dat1 = q1.execute()
dat1

	passenger_count	mean_tip
0	0	0.862099
1	254	0.000000
2	249	0.000000
3	1	1.151011
4	247	2.300000
5	4	0.844519
6	3	0.962949
7	6	1.128365
8	5	1.102732
9	9	0.806800
10	177	1.000000
11	65	0.000000
12	2	1.081580
13	208	0.000000
14	10	0.000000
15	7	0.544118
16	8	0.350769
17	66	1.500000

Ibis conversion to polars

The q1.execute() method above is equivalent calling q1.to_pandas(). A q1.to_polars() equivalent has been added to the dev version of Ibis, but is not available with the latest offical release (8.0.0 at the time of writing).

Digression: Interactive Ibis use and eager execution

At the very top of this document, I commented out the ibis.options.interactive option as part of my Ibis configuration. This was because I wanted to demonstrate the default deferred (i.e., lazy) behaviour of Ibis, which is just the same as d(b)plyr in R. If you are building data wrangling pipelines, or writing scripts with potentially complex queries, you probably want to preserve this deferred behaviour and avoid eager execution.

However, there are times when you may want to default into eager execution. For example, if your dataset is of manageable size, or you are trying to iterate through different query operations… Or, you might just want to enable it so that you automatically get a nice print return object for your workshop materials. I’ll adopt the latter view, so that I can quickly demonstrate some Ibis syntax and results for the rest of this document.

ibis.options.interactive = True

Okay, let’s speed through some of the same basic queries that we’ve already seen in the DuckDB SQL and R (dplyr) pages. I won’t bother to explain them in depth. Just consider them for demonstration purposes.

Aggregation

(
  nyc
  .group_by(["passenger_count", "trip_distance"])
  .aggregate(
    mean_tip = _.tip_amount.mean(),
    mean_fare = _.fare_amount.mean()
    )
)

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64           │ float64       │ float64  │ float64   │
├─────────────────┼───────────────┼──────────┼───────────┤
│               1 │          0.90 │ 0.578975 │  5.687596 │
│               2 │          8.80 │ 2.669911 │ 24.652684 │
│               1 │          3.50 │ 1.384175 │ 12.883433 │
│               1 │          2.90 │ 1.223380 │ 11.352555 │
│               1 │          0.77 │ 0.499988 │  5.271770 │
│               1 │          1.90 │ 0.911617 │  8.669678 │
│               1 │          4.00 │ 1.516746 │ 14.098944 │
│               2 │          0.90 │ 0.495665 │  5.764356 │
│               1 │          0.10 │ 0.526552 │  6.079754 │
│               2 │          1.90 │ 0.783820 │  8.720520 │
│               … │             … │        … │         … │
└─────────────────┴───────────────┴──────────┴───────────┘

Note that, even though we have enabled the interactive print mode, we still get lazy evalation if we assign a chain of query steps to an object (here: q3)…

q3 = (
  nyc
  .group_by(["passenger_count", "trip_distance"])
  .agg(
    mean_tip = _.tip_amount.mean(),
    mean_fare = _.fare_amount.mean()
    )
)

… but printing the query to screen enforces computation.

q3

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64           │ float64       │ float64  │ float64   │
├─────────────────┼───────────────┼──────────┼───────────┤
│               1 │           2.0 │ 0.942336 │  8.932499 │
│               1 │          10.8 │ 3.808778 │ 29.238556 │
│               1 │           3.0 │ 1.250686 │ 11.611495 │
│               1 │           1.2 │ 0.688208 │  6.640748 │
│               1 │           1.3 │ 0.720728 │  6.941542 │
│               1 │           3.6 │ 1.404022 │ 13.127911 │
│               1 │           4.1 │ 1.545880 │ 14.321923 │
│               1 │           1.4 │ 0.753561 │  7.243111 │
│               2 │           3.3 │ 1.171171 │ 12.518436 │
│               2 │           2.0 │ 0.810508 │  8.988705 │
│               … │             … │        … │         … │
└─────────────────┴───────────────┴──────────┴───────────┘

Pivot (reshape)

# now chain on pivoting (and enforce computation via printing)
(
  q3
  .pivot_longer(s.r["mean_tip":"mean_fare"])
)

┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓
┃ passenger_count ┃ trip_distance ┃ name      ┃ value    ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩
│ int64           │ float64       │ string    │ float64  │
├─────────────────┼───────────────┼───────────┼──────────┤
│               2 │           0.7 │ mean_tip  │ 0.433840 │
│               2 │           0.7 │ mean_fare │ 5.117718 │
│               1 │           0.8 │ mean_tip  │ 0.534714 │
│               1 │           0.8 │ mean_fare │ 5.360443 │
│               1 │           0.7 │ mean_tip  │ 0.493234 │
│               1 │           0.7 │ mean_fare │ 5.035307 │
│               2 │           0.3 │ mean_tip  │ 0.310850 │
│               2 │           0.3 │ mean_fare │ 4.192242 │
│               3 │           0.8 │ mean_tip  │ 0.414165 │
│               3 │           0.8 │ mean_fare │ 5.470059 │
│               … │             … │ …         │        … │
└─────────────────┴───────────────┴───────────┴──────────┘

Joins (merges)

(As we did in the dplyr code, we’ll break this contrived join example into two steps)

mean_tips = nyc.group_by("month").agg(mean_tip = _.tip_amount.mean())
mean_fares = nyc.group_by("month").agg(mean_fare = _.fare_amount.mean())

(
  mean_tips
  .left_join(mean_fares, "month")
)

┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ month ┃ mean_tip ┃ month_right ┃ mean_fare ┃
┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64 │ float64  │ int64       │ float64   │
├───────┼──────────┼─────────────┼───────────┤
│    11 │ 1.250903 │          11 │ 12.270138 │
│    10 │ 1.281239 │          10 │ 12.501252 │
│     2 │ 1.036874 │           2 │  9.942640 │
│     4 │ 1.043167 │           4 │ 10.335490 │
│     1 │ 1.007817 │           1 │  9.813488 │
│     5 │ 1.078014 │           5 │ 10.585157 │
│     9 │ 1.254601 │           9 │ 12.391198 │
│     7 │ 1.059312 │           7 │ 10.379943 │
│     8 │ 1.079521 │           8 │ 10.492650 │
│    12 │ 1.237651 │          12 │ 12.313953 │
│     … │        … │           … │         … │
└───────┴──────────┴─────────────┴───────────┘