import ibis
import ibis.selectors as s
from ibis import _
# ibis.options.interactive = True # enforce eager execution of queries
DuckDB + Ibis (Python)
Use a dplyr-esque Python frontend
Load libraries
Connect and register
## Instantiate an in-memory DuckDB connection from Ibis
= ibis.duckdb.connect()
con
## Register our parquet dataset as a table called "nyc" in our connection
"nyc-taxi/**/*.parquet", "nyc") con.register(
DatabaseTable: nyc vendor_name string pickup_datetime timestamp dropoff_datetime timestamp passenger_count int64 trip_distance float64 pickup_longitude float64 pickup_latitude float64 rate_code string store_and_fwd string dropoff_longitude float64 dropoff_latitude float64 payment_type string fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 total_amount float64 improvement_surcharge float64 congestion_surcharge float64 pickup_location_id int64 dropoff_location_id int64 month int64 year int64
Aside: Remember that you can create a persistent, disk-backed database by giving it an appropriate name/path. This also enables out-of-core computation for bigger than RAM data.
# con = ibis.duckdb.connect("nyc.dbb")
# con.register("nyc-taxi/**/*.parquet", "nyc")
# etc.
Reference the table from Python. We’ll call this reference object nyc
too for consistency, but you could call it whatever you want (e.g., you could call it nyc_ibis
to avoid potential ambiguity with the “nyc” table in our actual DuckDB connection). Printing the object to screen will give us a lazy preview.
# con.list_tables() # Optional: confirm that our table is available
= con.table("nyc")
nyc nyc
DatabaseTable: nyc vendor_name string pickup_datetime timestamp dropoff_datetime timestamp passenger_count int64 trip_distance float64 pickup_longitude float64 pickup_latitude float64 rate_code string store_and_fwd string dropoff_longitude float64 dropoff_latitude float64 payment_type string fare_amount float64 extra float64 mta_tax float64 tip_amount float64 tolls_amount float64 total_amount float64 improvement_surcharge float64 congestion_surcharge float64 pickup_location_id int64 dropoff_location_id int64 month int64 year int64
First example
= (
q1
nyc"passenger_count"])
.group_by([= _.tip_amount.mean())
.agg(mean_tip )
To see the underlying SQL translation, use ibis.to_sql()
ibis.to_sql(q1)
SELECT
t0.passenger_count,AVG(t0.tip_amount) AS mean_tip
FROM main.nyc AS t0
GROUP BY
1
To actually execute the query and bring the result into Python, we can use the execute()
method. By default this will coerce to a pandas DataFrame.
= q1.execute()
dat1 dat1
passenger_count | mean_tip | |
---|---|---|
0 | 0 | 0.862099 |
1 | 254 | 0.000000 |
2 | 249 | 0.000000 |
3 | 1 | 1.151011 |
4 | 247 | 2.300000 |
5 | 4 | 0.844519 |
6 | 3 | 0.962949 |
7 | 6 | 1.128365 |
8 | 5 | 1.102732 |
9 | 9 | 0.806800 |
10 | 177 | 1.000000 |
11 | 65 | 0.000000 |
12 | 2 | 1.081580 |
13 | 208 | 0.000000 |
14 | 10 | 0.000000 |
15 | 7 | 0.544118 |
16 | 8 | 0.350769 |
17 | 66 | 1.500000 |
The q1.execute()
method above is equivalent calling q1.to_pandas()
. A q1.to_polars()
equivalent has been added to the dev version of Ibis, but is not available with the latest offical release (8.0.0 at the time of writing).
Digression: Interactive Ibis use and eager execution
At the very top of this document, I commented out the ibis.options.interactive
option as part of my Ibis configuration. This was because I wanted to demonstrate the default deferred (i.e., lazy) behaviour of Ibis, which is just the same as d(b)plyr in R. If you are building data wrangling pipelines, or writing scripts with potentially complex queries, you probably want to preserve this deferred behaviour and avoid eager execution.
However, there are times when you may want to default into eager execution. For example, if your dataset is of manageable size, or you are trying to iterate through different query operations… Or, you might just want to enable it so that you automatically get a nice print return object for your workshop materials. I’ll adopt the latter view, so that I can quickly demonstrate some Ibis syntax and results for the rest of this document.
= True ibis.options.interactive
Okay, let’s speed through some of the same basic queries that we’ve already seen in the DuckDB SQL and R (dplyr) pages. I won’t bother to explain them in depth. Just consider them for demonstration purposes.
Aggregation
(
nyc"passenger_count", "trip_distance"])
.group_by([
.aggregate(= _.tip_amount.mean(),
mean_tip = _.fare_amount.mean()
mean_fare
) )
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩ │ int64 │ float64 │ float64 │ float64 │ ├─────────────────┼───────────────┼──────────┼───────────┤ │ 1 │ 0.90 │ 0.578975 │ 5.687596 │ │ 2 │ 8.80 │ 2.669911 │ 24.652684 │ │ 1 │ 3.50 │ 1.384175 │ 12.883433 │ │ 1 │ 2.90 │ 1.223380 │ 11.352555 │ │ 1 │ 0.77 │ 0.499988 │ 5.271770 │ │ 1 │ 1.90 │ 0.911617 │ 8.669678 │ │ 1 │ 4.00 │ 1.516746 │ 14.098944 │ │ 2 │ 0.90 │ 0.495665 │ 5.764356 │ │ 1 │ 0.10 │ 0.526552 │ 6.079754 │ │ 2 │ 1.90 │ 0.783820 │ 8.720520 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴──────────┴───────────┘
Note that, even though we have enabled the interactive print mode, we still get lazy evalation if we assign a chain of query steps to an object (here: q3
)…
= (
q3
nyc"passenger_count", "trip_distance"])
.group_by([
.agg(= _.tip_amount.mean(),
mean_tip = _.fare_amount.mean()
mean_fare
) )
… but printing the query to screen enforces computation.
q3
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ mean_tip ┃ mean_fare ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━┩ │ int64 │ float64 │ float64 │ float64 │ ├─────────────────┼───────────────┼──────────┼───────────┤ │ 1 │ 2.0 │ 0.942336 │ 8.932499 │ │ 1 │ 10.8 │ 3.808778 │ 29.238556 │ │ 1 │ 3.0 │ 1.250686 │ 11.611495 │ │ 1 │ 1.2 │ 0.688208 │ 6.640748 │ │ 1 │ 1.3 │ 0.720728 │ 6.941542 │ │ 1 │ 3.6 │ 1.404022 │ 13.127911 │ │ 1 │ 4.1 │ 1.545880 │ 14.321923 │ │ 1 │ 1.4 │ 0.753561 │ 7.243111 │ │ 2 │ 3.3 │ 1.171171 │ 12.518436 │ │ 2 │ 2.0 │ 0.810508 │ 8.988705 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴──────────┴───────────┘
Pivot (reshape)
# now chain on pivoting (and enforce computation via printing)
(
q3"mean_tip":"mean_fare"])
.pivot_longer(s.r[ )
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━┓ ┃ passenger_count ┃ trip_distance ┃ name ┃ value ┃ ┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━┩ │ int64 │ float64 │ string │ float64 │ ├─────────────────┼───────────────┼───────────┼──────────┤ │ 2 │ 0.7 │ mean_tip │ 0.433840 │ │ 2 │ 0.7 │ mean_fare │ 5.117718 │ │ 1 │ 0.8 │ mean_tip │ 0.534714 │ │ 1 │ 0.8 │ mean_fare │ 5.360443 │ │ 1 │ 0.7 │ mean_tip │ 0.493234 │ │ 1 │ 0.7 │ mean_fare │ 5.035307 │ │ 2 │ 0.3 │ mean_tip │ 0.310850 │ │ 2 │ 0.3 │ mean_fare │ 4.192242 │ │ 3 │ 0.8 │ mean_tip │ 0.414165 │ │ 3 │ 0.8 │ mean_fare │ 5.470059 │ │ … │ … │ … │ … │ └─────────────────┴───────────────┴───────────┴──────────┘
Joins (merges)
(As we did in the dplyr code, we’ll break this contrived join example into two steps)
= nyc.group_by("month").agg(mean_tip = _.tip_amount.mean())
mean_tips = nyc.group_by("month").agg(mean_fare = _.fare_amount.mean()) mean_fares
(
mean_tips"month")
.left_join(mean_fares, )
┏━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓ ┃ month ┃ mean_tip ┃ month_right ┃ mean_fare ┃ ┡━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩ │ int64 │ float64 │ int64 │ float64 │ ├───────┼──────────┼─────────────┼───────────┤ │ 11 │ 1.250903 │ 11 │ 12.270138 │ │ 10 │ 1.281239 │ 10 │ 12.501252 │ │ 2 │ 1.036874 │ 2 │ 9.942640 │ │ 4 │ 1.043167 │ 4 │ 10.335490 │ │ 1 │ 1.007817 │ 1 │ 9.813488 │ │ 5 │ 1.078014 │ 5 │ 10.585157 │ │ 9 │ 1.254601 │ 9 │ 12.391198 │ │ 7 │ 1.059312 │ 7 │ 10.379943 │ │ 8 │ 1.079521 │ 8 │ 10.492650 │ │ 12 │ 1.237651 │ 12 │ 12.313953 │ │ … │ … │ … │ … │ └───────┴──────────┴─────────────┴───────────┘