1

I have a BigQuery table where a PubSub subscription inserts new web events every second. This table is partition by:

  • column: derived_tstamp
  • type: timestamp
  • granularity: daily

To create a specific model from this data I need to build an incremental model that only inserts new events into a staging table and using partition pruning when scanning for the last event timestamp for cost savings.

The easy option with using a subselect does not work because BigQuery Partition Pruning does not support dynamic table values.

https://cloud.google.com/bigquery/docs/querying-partitioned-tables#better_performance_with_pseudocolumns:

However, the second filter condition doesn't limit the scanned partitions, because it uses table values, which are dynamic.

SELECT
  column
FROM
  dataset.table2
WHERE
  -- This filter condition limits the scanned partitions:
  _PARTITIONTIME BETWEEN TIMESTAMP('2017-01-01') AND TIMESTAMP('2017-03-01')
  -- This one doesn't, because it uses dynamic table values:
  AND _PARTITIONTIME = (SELECT MAX(timestamp) from dataset.table1)

So this code still scans the whole table:

{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("table_A")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= (select max(derived_tstamp) from {{ this}})
        {% endif %}
)


select * from events

I could declare and set a variable via the SQL Header macro but this throws an error when doing a --full-refresh because the table does not exists yet when running:

{% call set_sql_header(config) %}
    declare max_derived_tstamp timestamp;
    set     max_derived_tstamp = (select max(derived_tstamp) from {{ this }}); # CRASHES
{% endcall %}

{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("web_events")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= max_derived_tstamp
        {% endif %}
)


select * from events

How can I create an incremental model that does perform partition pruning in BigQuery?

1 Answer 1

-1

A couple of ideas:

  1. Switch your incremental strategy to insert_overwrite (details here) and use last N days in your incremental filter.
{% set partitions_to_replace = [
  'timestamp(current_date)',
  'timestamp(date_sub(current_date, interval 1 day))'
] %}

{{
config(
    materialized="incremental",
    incremental_strategy="insert_overwrite",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
    partitions = partitions_to_replace
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("table_A")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and timestamp_trunc(derived_tstamp, day) in ({{ partitions_to_replace | join(',') }})
        {% endif %}
)


select * from events
  1. Add is_incremental condition to your second solution:
{% set max_partition_query %}
    select max(derived_tstamp) from {{ this }}
{% endset %}

{% set results = run_query(max_partition_query) %}
{% if execute and is_incremental() %}
{% set max_derived_tstamp = results.columns[0].values()[0] %}
{% else %}
{% set max_derived_tstamp = '2000-01-01 00:00:00' %}
{% endif %}


{{
config(
    materialized="incremental",
    incremental_strategy="merge",
    on_schema_change="append_new_columns",
    partition_by={
        "field": "derived_tstamp",
        "data_type": "timestamp",
        "granularity": "day",
    },
)
}}


with
    events as (
        select
            derived_tstamp,
        from
            {{ ref("web_events")
        where
            timestamp_trunc(derived_tstamp, day) >= timestamp("2024-01-01")
        {% if is_incremental() %}
            and derived_tstamp >= {{ max_derived_tstamp }}
        {% endif %}
)


select * from events
Sign up to request clarification or add additional context in comments.

4 Comments

your proposed solution #1 does not insert new events but all events since the last 7 days = duplicates = not useable. #2 does not work because you can't use set in a CTA which dbt does create around the model.
And how do you plan to reference {{ this }} on --full-refresh? The table does not exist in this case and the model fails, your code is not idempontent and does not work.
Hi Vega. Regarding #1. You specified incremental_strategy = merge, this means it will merge rows by unique_key, meaning it will not generate duplicates. You must specify unique_key, though. Anyway, I've updated the #1 approach. There is insert_overwrite incremental strategy, which I believe will solve your problem. It is similar to my initially proposed solution (you specify the last N days you want to update on each incremental run).
Regarding #2. I'm not sure I understood your statement you can't use set in a CTA. Particularly, could you clarify what do you mean by CTA? More generally, I tested similar code on my BigQuery instance, and it worked fine as I expected. For your next statement: is it your gut feeling that it won't work or did you run it and it raised an error? Answering directly to your question: I plan to reference {{ this }} only on the incremental run, because I've added is_incremental() condition, that you probably didn't notice. Let me know if I missed something and if you have specific errors.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.