This repository was archived by the owner on Nov 10, 2022. It is now read-only.
Historical retrieval #3
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
The current SQL template of point in time join does not scale
Problem 1: ROW_NUMBER() does not scale
In order to calculate a unique ID for each row of the entity dataframe, we compute the following
The problem is that BigQuery will need to send all the data to a single worker in order to properly calculate the row number of each row. For our use case, we end up with a OOM error

Solution
The solution is to calculate a deterministic hash that will act as a unique identifier.
Because the entity_dataframe should contain all entity keys, I use
FARM_FINGERPRINT()that will compute a deterministic hash for a given input. This hash is computed in a distributed fashion as it only needs the datapoints of a given row.Alternative
I tried
GENERATE_UUID()that is non deterministic and the query got wrong because i suspect that it computed it multiple times (depending on how the SQL query gets optimized & parsed). So we ended up with all features being always NullTODO: Matt can look at how the query is interpreted and see if
GENERATE_UUID()is called multiple timesProblem 2: Window functions and ORDER BY are often the bottleneck
As a former data scientist, I often realized that big data engines (Spark, BigQuery, Presto, etc.) are often much more efficient with a series of GROUP BY rather than a Window function. Moreover, we should avoid ORDER BY operations as much as possible.
So this PR comes with a new approach to compute the point in time join. This is solely compose of JOINs and GROUP BYs
Here are the results of my benchmark:
Context
I perform the same query using both templates. For the original template, I switch the
ROW_NUMBER()of the entity dataframe by theFARM_FINGERPRINT()one as explain above.The api call is the following
And some idea of the scale of this historical retrieval
feature_view_Acontains ~5B rows and ~3.6B unique "user_id"feature_view_Bcontains ~5B rows and ~3.6B unique "user_id"feature_view_Ccontains ~1.7B rows and ~1.1B unique "user_id"feature_view_Dcontains ~42B rows and ~3.5B unique "user_id"Results
On the original SQL template
With the SQL template of this PR
So as we can see, the proposed SQL template consume half the resources that the one currently implemented.
Also, because this new SQL template is only composed of JOINs and GROUP BY, it should scale "indefinitely" except if the data is highly skewed (eg: a single "user_id" represents 20% of the dataset).
Which issue(s) this PR fixes:
Fixes #
Does this PR introduce a user-facing change?: