DuckDB + Ibis (Python)

Use a dplyr-esque Python frontend

Load libraries

import ibis
import ibis.selectors as s
from ibis import _
# ibis.options.interactive = True # enforce eager execution of queries

Connect and register

## Instantiate an in-memory DuckDB connection from Ibis
con = ibis.duckdb.connect()

## Register our parquet dataset as a table called "nyc" in our connection
nyc = con.read_parquet("nyc-taxi/**/*.parquet")

Aside: Remember that you can create a persistent, disk-backed database by giving it an appropriate name/path. This also enables out-of-core computation for bigger than RAM data.

# con = ibis.duckdb.connect("nyc.dbb")
# nyc = con.read_parquet("nyc-taxi/**/*.parquet")
# etc.

Reference the table from Python. We’ll call this reference object nyc too for consistency, but you could call it whatever you want (e.g., you could call it nyc_ibis to avoid potential ambiguity with the “nyc” table in our actual DuckDB connection). Printing the object to screen will give us a lazy preview.

nyc
DatabaseTable: ibis_read_parquet_ubbotra4snbqjhfys5d3pmq44e
  vendor_name           string
  pickup_datetime       timestamp(6)
  dropoff_datetime      timestamp(6)
  passenger_count       int64
  trip_distance         float64
  pickup_longitude      float64
  pickup_latitude       float64
  rate_code             string
  store_and_fwd         string
  dropoff_longitude     float64
  dropoff_latitude      float64
  payment_type          string
  fare_amount           float64
  extra                 float64
  mta_tax               float64
  tip_amount            float64
  tolls_amount          float64
  total_amount          float64
  improvement_surcharge float64
  congestion_surcharge  float64
  pickup_location_id    int64
  dropoff_location_id   int64
  month                 int64
  year                  int64

First example

q1 = (
  nyc
  .group_by(["passenger_count"])
  .agg(mean_tip = _.tip_amount.mean())
)

To see the underlying SQL translation, use ibis.to_sql()

ibis.to_sql(q1)
SELECT
  "t0"."passenger_count",
  AVG("t0"."tip_amount") AS "mean_tip"
FROM "ibis_read_parquet_ubbotra4snbqjhfys5d3pmq44e" AS "t0"
GROUP BY
  1

To actually execute the query and bring the result into Python, we can use the execute() method. By default this will coerce to a pandas DataFrame.

dat1 = q1.execute()
dat1
passenger_count mean_tip
0 249 0.000000
1 0 0.862099
2 254 0.000000
3 8 0.350769
4 66 1.500000
5 2 1.081580
6 208 0.000000
7 10 0.000000
8 1 1.151011
9 247 2.300000
10 3 0.962949
11 6 1.128365
12 5 1.102732
13 65 0.000000
14 9 0.806800
15 177 1.000000
16 7 0.544118
17 4 0.844519
NoteIbis conversion to polars

The q1.execute() method above is equivalent calling q1.to_pandas(). A q1.to_polars() equivalent has been added to the dev version of Ibis, but is not available with the latest offical release (8.0.0 at the time of writing).

Digression: Interactive Ibis use and eager execution

At the very top of this document, I commented out the ibis.options.interactive option as part of my Ibis configuration. This was because I wanted to demonstrate the default deferred (i.e., lazy) behaviour of Ibis, which is just the same as d(b)plyr in R. If you are building data wrangling pipelines, or writing scripts with potentially complex queries, you probably want to preserve this deferred behaviour and avoid eager execution.

However, there are times when you may want to default into eager execution. For example, if your dataset is of manageable size, or you are trying to iterate through different query operations… Or, you might just want to enable it so that you automatically get a nice print return object for your workshop materials. I’ll adopt the latter view, so that I can quickly demonstrate some Ibis syntax and results for the rest of this document.

ibis.options.interactive = True

Okay, let’s speed through some of the same basic queries that we’ve already seen in the DuckDB SQL and R (dplyr) pages. I won’t bother to explain them in depth. Just consider them for demonstration purposes.

Aggregation

(
  nyc
  .group_by(["passenger_count", "trip_distance"])
  .aggregate(
    mean_tip = _.tip_amount.mean(),
    mean_fare = _.fare_amount.mean()
    )
)
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ passenger_count  trip_distance  mean_tip  mean_fare ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64float64float64float64   │
├─────────────────┼───────────────┼──────────┼───────────┤
│               15.31.81283617.073446 │
│               23.21.14448812.271308 │
│               03.21.14341111.369103 │
│               18.93.22242124.642239 │
│               117.45.63578847.341387 │
│               41.20.3976336.819041 │
│               12.41.06948610.005681 │
│               316.93.05697647.266028 │
│               317.82.75545248.073206 │
│               13.11.27896211.872071 │
│                │
└─────────────────┴───────────────┴──────────┴───────────┘

Note that, even though we have enabled the interactive print mode, we still get lazy evalation if we assign a chain of query steps to an object (here: q3)…

q3 = (
  nyc
  .group_by(["passenger_count", "trip_distance"])
  .agg(
    mean_tip = _.tip_amount.mean(),
    mean_fare = _.fare_amount.mean()
    )
)

… but printing the query to screen enforces computation.

q3
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓
┃ passenger_count  trip_distance  mean_tip  mean_fare ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64float64float64float64   │
├─────────────────┼───────────────┼──────────┼───────────┤
│               111.103.83935829.974973 │
│               22.600.97325610.619201 │
│               13.501.38417512.883433 │
│               12.301.0404639.740110 │
│               31.860.7773338.460294 │
│               10.900.5789755.687596 │
│               26.801.97742220.725585 │
│               21.700.7275608.173302 │
│               10.810.5132975.402281 │
│               10.770.4999885.271770 │
│                │
└─────────────────┴───────────────┴──────────┴───────────┘

Pivot (reshape)

# now chain on pivoting (and enforce computation via printing)
(
  q3
  .pivot_longer(["mean_tip", "mean_fare"])
)
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ passenger_count  trip_distance  name       value     ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64float64stringfloat64   │
├─────────────────┼───────────────┼───────────┼───────────┤
│               111.10mean_tip 3.839358 │
│               111.10mean_fare29.974973 │
│               22.60mean_tip 0.973256 │
│               22.60mean_fare10.619201 │
│               13.50mean_tip 1.384175 │
│               13.50mean_fare12.883433 │
│               12.30mean_tip 1.040463 │
│               12.30mean_fare9.740110 │
│               31.86mean_tip 0.777333 │
│               31.86mean_fare8.460294 │
│                │
└─────────────────┴───────────────┴───────────┴───────────┘

Joins (merges)

(As we did in the dplyr code, we’ll break this contrived join example into two steps)

mean_tips = nyc.group_by("month").agg(mean_tip = _.tip_amount.mean())
mean_fares = nyc.group_by("month").agg(mean_fare = _.fare_amount.mean())
(
  mean_tips
  .left_join(mean_fares, "month")
)
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ month  mean_tip  month_right  mean_fare ┃
┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ int64float64int64float64   │
├───────┼──────────┼─────────────┼───────────┤
│     81.079521810.492650 │
│     61.091082610.548651 │
│     91.254601912.391198 │
│     51.078014510.585157 │
│     71.059312710.379943 │
│     31.056353310.223107 │
│     21.03687429.942640 │
│     41.043167410.335490 │
│    121.2376511212.313953 │
│    111.2509031112.270138 │
│      │
└───────┴──────────┴─────────────┴───────────┘

Disconnect

con.disconnect()