Partition pruning in BigQuery with incremental model

Question

I have a BigQuery table where a PubSub subscription inserts new web events every second. This table is partition by:

column: derived_tstamp
type: timestamp
granularity: daily

To create a specific model from this data I need to build an incremental model that only inserts new events into a staging table and using partition pruning when scanning for the last event timestamp for cost savings.

The easy option with using a subselect does not work because BigQuery Partition Pruning does not support dynamic table values.

https://cloud.google.com/bigquery/docs/querying-partitioned-tables#better_performance_with_pseudocolumns:

However, the second filter condition doesn't limit the scanned partitions, because it uses table values, which are dynamic.

SELECT
  column
FROM
  dataset.table2
WHERE
  -- This filter condition limits the scanned partitions:
  _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01')
  -- This one doesn't, because it uses dynamic table values:
  AND _PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)

So this code still scans the whole table:

{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("table_A")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= (select max(derived_tstamp) from {{ this}})
        {% endif %}
)


select * from events

I could declare and set a variable via the SQL Header macro but this throws an error when doing a --full-refresh because the table does not exists yet when running:

{% call set_sql_header(config) %}
    declare max_derived_tstamp timestamp;
    set     max_derived_tstamp = (select max(derived_tstamp) from {{ this }}); # CRASHES
{% endcall %}

{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("web_events")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= max_derived_tstamp
        {% endif %}
)


select * from events

How can I create an incremental model that does perform partition pruning in BigQuery?

Kliment Merzlyakov · Accepted Answer · 2024-11-05 15:53:18Z

-1

A couple of ideas:

Switch your incremental strategy to insert_overwrite (details here) and use last N days in your incremental filter.

{% set partitions_to_replace = [
  'timestamp(current_date)',
  'timestamp(date_sub(current_date, interval 1 day))'
] %}

{{
config(
    materialized="incremental",
    incremental_strategy="insert_overwrite",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
    partitions = partitions_to_replace
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("table_A")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and timestamp_trunc(derived_tstamp, day) in ({{ partitions_to_replace | join(',') }})
        {% endif %}
)


select * from events

Add is_incremental condition to your second solution:

{% set max_partition_query %}
    select max(derived_tstamp) from {{ this }}
{% endset %}

{% set results = run_query(max_partition_query) %}
{% if execute and is_incremental() %}
{% set max_derived_tstamp = results.columns[0].values()[0] %}
{% else %}
{% set max_derived_tstamp = '2000-01-01 00:00:00' %}
{% endif %}


{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("web_events")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= {{ max_derived_tstamp }}
        {% endif %}
)


select * from events

edited Nov 5, 2024 at 15:53

answered Nov 5, 2024 at 1:29

Kliment Merzlyakov

1,1137 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Vega Over a year ago

your proposed solution #1 does not insert new events but all events since the last 7 days = duplicates = not useable. #2 does not work because you can't use set in a CTA which dbt does create around the model.

Vega Over a year ago

And how do you plan to reference {{ this }} on --full-refresh? The table does not exist in this case and the model fails, your code is not idempontent and does not work.

Kliment Merzlyakov Over a year ago

Hi Vega. Regarding #1. You specified incremental_strategy = merge, this means it will merge rows by unique_key, meaning it will not generate duplicates. You must specify unique_key, though. Anyway, I've updated the #1 approach. There is insert_overwrite incremental strategy, which I believe will solve your problem. It is similar to my initially proposed solution (you specify the last N days you want to update on each incremental run).

Kliment Merzlyakov Over a year ago

Regarding #2. I'm not sure I understood your statement you can't use set in a CTA. Particularly, could you clarify what do you mean by CTA? More generally, I tested similar code on my BigQuery instance, and it worked fine as I expected. For your next statement: is it your gut feeling that it won't work or did you run it and it raised an error? Answering directly to your question: I plan to reference {{ this }} only on the incremental run, because I've added is_incremental() condition, that you probably didn't notice. Let me know if I missed something and if you have specific errors.

Collectives™ on Stack Overflow

Partition pruning in BigQuery with incremental model

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related