I have a BigQuery table where a PubSub subscription inserts new web events every second. This table is partition by:
- column:
derived_tstamp - type: timestamp
- granularity: daily
To create a specific model from this data I need to build an incremental model that only inserts new events into a staging table and using partition pruning when scanning for the last event timestamp for cost savings.
The easy option with using a subselect does not work because BigQuery Partition Pruning does not support dynamic table values.
However, the second filter condition doesn't limit the scanned partitions, because it uses table values, which are dynamic.
SELECT
column
FROM
dataset.table2
WHERE
-- This filter condition limits the scanned partitions:
_PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01')
-- This one doesn't, because it uses dynamic table values:
AND _PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)
So this code still scans the whole table:
{{
config(
materialized="incremental",
incremental_strategy="merge",
on_schema_change="append_new_columns",
partition_by={
"field": "derived_tstamp",
"data_type": "timestamp",
"granularity": "day",
},
)
}}
with
events as (
select
derived_tstamp,
from
{{ ref("table_A")
where
timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
{% if is_incremental() %}
and derived_tstamp >= (select max(derived_tstamp) from {{ this}})
{% endif %}
)
select * from events
I could declare and set a variable via the SQL Header macro but this throws an error when doing a --full-refresh because the table does not exists yet when running:
{% call set_sql_header(config) %}
declare max_derived_tstamp timestamp;
set max_derived_tstamp = (select max(derived_tstamp) from {{ this }}); # CRASHES
{% endcall %}
{{
config(
materialized="incremental",
incremental_strategy="merge",
on_schema_change="append_new_columns",
partition_by={
"field": "derived_tstamp",
"data_type": "timestamp",
"granularity": "day",
},
)
}}
with
events as (
select
derived_tstamp,
from
{{ ref("web_events")
where
timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
{% if is_incremental() %}
and derived_tstamp >= max_derived_tstamp
{% endif %}
)
select * from events
How can I create an incremental model that does perform partition pruning in BigQuery?