Skip to content
This repository was archived by the owner on Nov 10, 2022. It is now read-only.

Conversation

@MattDelac
Copy link

@MattDelac MattDelac commented May 29, 2021

What this PR does / why we need it:
The current SQL template of point in time join does not scale

Problem 1: ROW_NUMBER() does not scale

In order to calculate a unique ID for each row of the entity dataframe, we compute the following

SELECT ROW_NUMBER() OVER() AS row_number, edf.* FROM {{ left_table_query_string }} as edf

The problem is that BigQuery will need to send all the data to a single worker in order to properly calculate the row number of each row. For our use case, we end up with a OOM error
image

Solution
The solution is to calculate a deterministic hash that will act as a unique identifier.
Because the entity_dataframe should contain all entity keys, I use FARM_FINGERPRINT() that will compute a deterministic hash for a given input. This hash is computed in a distributed fashion as it only needs the datapoints of a given row.

WITH entity_df AS (
    SELECT
        *,
        FARM_FINGERPRINT(CONCAT(
            {% for entity_key in unique_entity_keys %}
                CAST({{entity_key}} AS STRING),
            {% endfor %}
            CAST({{entity_df_event_timestamp_col}} AS STRING)
        )) AS entity_row_unique_id
    FROM {{ left_table_query_string }}
),

Alternative
I tried GENERATE_UUID() that is non deterministic and the query got wrong because i suspect that it computed it multiple times (depending on how the SQL query gets optimized & parsed). So we ended up with all features being always Null

TODO: Matt can look at how the query is interpreted and see if GENERATE_UUID() is called multiple times

Problem 2: Window functions and ORDER BY are often the bottleneck

As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.

So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs

Here are the results of my benchmark:
Context
I perform the same query using both templates. For the original template, I switch the ROW_NUMBER() of the entity dataframe by the FARM_FINGERPRINT() one as explain above.

The api call is the following

result = store.get_historical_features(
    entity_df="""
    SELECT user_id, TIMESTAMP '2021-01-01' AS observed_at
    FROM `my_dataset`
    LIMIT 100000000 -- 100M
    """,
    feature_refs=[
        "feature_view_A:feature_1",
        "feature_view_A:feature_2",
        "feature_view_B:feature_3",
        "feature_view_B:feature_4",
        "feature_view_B:feature_5",
        "feature_view_C:feature_6",
        "feature_view_D:feature_7",
    ]
)

And some idea of the scale of this historical retrieval

  • feature_view_A contains ~5B rows and ~3.6B unique "user_id"
  • feature_view_B contains ~5B rows and ~3.6B unique "user_id"
  • feature_view_C contains ~1.7B rows and ~1.1B unique "user_id"
  • feature_view_D contains ~42B rows and ~3.5B unique "user_id"

Results
On the original SQL template

  • Elapsed time: 5 min 59 sec
  • Slot time consumed: 5 days 5 hr
  • Bytes shuffled: 5.89 TB
  • Bytes spilled to disk: 0 B

With the SQL template of this PR

  • Elapsed time: 3 min 21 sec (-44%)
  • Slot time consumed: 2 days 12 hr (-52%)
  • Bytes shuffled: 3.55 TB (-40%)
  • Bytes spilled to disk: 0 B

So as we can see, the proposed SQL template consume half the resources that the one currently implemented.

Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:


@MattDelac MattDelac force-pushed the historical_retrieval branch from 2231ece to 74b11d1 Compare May 31, 2021 15:28
@asad-a
Copy link

asad-a commented May 31, 2021

The description is clear and easy for me to understand. The impact of the solution looks great, @MattDelac!

MattDelac pushed a commit that referenced this pull request Dec 1, 2021
Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants