Historical retrieval #3

MattDelac · 2021-05-29T17:22:15Z

What this PR does / why we need it:
The current SQL template of point in time join does not scale

Problem 1: ROW_NUMBER() does not scale

In order to calculate a unique ID for each row of the entity dataframe, we compute the following

SELECT ROW_NUMBER() OVER() AS row_number, edf.* FROM {{ left_table_query_string }} as edf

The problem is that BigQuery will need to send all the data to a single worker in order to properly calculate the row number of each row. For our use case, we end up with a OOM error

Solution
The solution is to calculate a deterministic hash that will act as a unique identifier.
Because the entity_dataframe should contain all entity keys, I use FARM_FINGERPRINT() that will compute a deterministic hash for a given input. This hash is computed in a distributed fashion as it only needs the datapoints of a given row.

WITH entity_df AS (
    SELECT
        *,
        FARM_FINGERPRINT(CONCAT(
            {% for entity_key in unique_entity_keys %}
                CAST({{entity_key}} AS STRING),
            {% endfor %}
            CAST({{entity_df_event_timestamp_col}} AS STRING)
        )) AS entity_row_unique_id
    FROM {{ left_table_query_string }}
),

Alternative
I tried GENERATE_UUID() that is non deterministic and the query got wrong because i suspect that it computed it multiple times (depending on how the SQL query gets optimized & parsed). So we ended up with all features being always Null

TODO: Matt can look at how the query is interpreted and see if GENERATE_UUID() is called multiple times

Problem 2: Window functions and ORDER BY are often the bottleneck

As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.

So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs

Here are the results of my benchmark:
Context
I perform the same query using both templates. For the original template, I switch the ROW_NUMBER() of the entity dataframe by the FARM_FINGERPRINT() one as explain above.

The api call is the following

result = store.get_historical_features(
    entity_df="""
    SELECT user_id, TIMESTAMP '2021-01-01' AS observed_at
    FROM `my_dataset`
    LIMIT 100000000 -- 100M
    """,
    feature_refs=[
        "feature_view_A:feature_1",
        "feature_view_A:feature_2",
        "feature_view_B:feature_3",
        "feature_view_B:feature_4",
        "feature_view_B:feature_5",
        "feature_view_C:feature_6",
        "feature_view_D:feature_7",
    ]
)

And some idea of the scale of this historical retrieval

feature_view_A contains ~5B rows and ~3.6B unique "user_id"
feature_view_B contains ~5B rows and ~3.6B unique "user_id"
feature_view_C contains ~1.7B rows and ~1.1B unique "user_id"
feature_view_D contains ~42B rows and ~3.5B unique "user_id"

Results
On the original SQL template

Elapsed time: 5 min 59 sec
Slot time consumed: 5 days 5 hr
Bytes shuffled: 5.89 TB
Bytes spilled to disk: 0 B

With the SQL template of this PR

Elapsed time: 3 min 21 sec (-44%)
Slot time consumed: 2 days 12 hr (-52%)
Bytes shuffled: 3.55 TB (-40%)
Bytes spilled to disk: 0 B

So as we can see, the proposed SQL template consume half the resources that the one currently implemented.

Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).

Which issue(s) this PR fixes:

Fixes #

Does this PR introduce a user-facing change?:

Add dev.yml for our forked project

Use shopify-data-ml-platform-exp project

asad-a · 2021-05-31T15:42:25Z

The description is clear and easy for me to understand. The impact of the solution looks great, @MattDelac!

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>

Matt Delacour and others added 9 commits April 27, 2021 17:18

Add dev.yml for our forked project

b55d739

Add gcloud auth

2bf6198

Merge pull request #2 from Shopify/add_dev_yml

4fd072f

Add dev.yml for our forked project

Use shopify-data-ml-platform-exp project

73bf706

Merge pull request #4 from Shopify/update_dev_yaml

f9f8c52

Use shopify-data-ml-platform-exp project

Try different historical retrieval strategy

43b2a07

Refactor of point in time template

2a1d15a

Use deterministic hash without a window function

01a8cff

Improve doc on SQL template

74b11d1

MattDelac force-pushed the historical_retrieval branch from 2231ece to 74b11d1 Compare May 31, 2021 15:28

MattDelac force-pushed the master branch from f9f8c52 to 6469647 Compare May 31, 2021 16:50

MattDelac force-pushed the master branch from 8565cb6 to cbd930e Compare June 30, 2021 18:01

MattDelac force-pushed the master branch from 36b40e6 to 7977a53 Compare August 5, 2021 11:57

MattDelac pushed a commit that referenced this pull request Dec 1, 2021

Fix integration tests #3

b5d8a53

Signed-off-by: Tsotne Tabidze <tsotne@tecton.ai>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Historical retrieval #3

Historical retrieval #3

Uh oh!

MattDelac commented May 29, 2021 •

edited

Loading

Uh oh!

asad-a commented May 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Historical retrieval #3

Are you sure you want to change the base?

Historical retrieval #3

Uh oh!

Conversation

MattDelac commented May 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem 1: ROW_NUMBER() does not scale

Problem 2: Window functions and ORDER BY are often the bottleneck

Uh oh!

asad-a commented May 31, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MattDelac commented May 29, 2021 •

edited

Loading