Filter out of order feature rows#273
Conversation
|
Hi @khorshuheng. Thanks for your PR. I'm waiting for a gojek member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Hi @khorshuheng, thanks so much for the submission!
Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended? |
|
@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter? |
Yes, this will be allowed through, and it is intended. If we require out of order feature rows to be handle in a larger time window, such as in the order of days, then the approach in this current PR would not be feasible. Instead, we would have to store the state externally, such as Redis. Though that will come at a price of increased latency. An alternative which i can think of, without sacrificing latency, would be to define a maximum allowable out-of-date duration. Eg. features which are 1 hour older than the ingestion time would be rejected, regardless of ingestion order. Would that fulfil the typical use cases of Feast? |
It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store. |
Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys? |
Ok I see where you are coming from, but I still need to clarify how this will map to our use cases. We must be able to support ingestion of data that has an event time that is of any age. For example users could ingest feature rows with timestamps of 1970 and we should support that. Although in the cases that we are going back far in history (months), we typically do not have very granular data. So the key space should not be too big. However, this could be bigger than a single node can handle. Especially if that worker node is handling ingestion for multiple ingestion jobs that are distinct. It seems like the expiration of keys serves mostly the purpose of allowing overwriting or "fixing" of bad data that has been ingested. Even though supporting that would be great, I would rather give up that functionality to ensure that we don't clobber new values with old values accidentally. Storing state in Redis makes sense, but I would love to be able to get around having that as a dependency for this functionality. |
Good catch, I will need to check Apache Beam documentation if the state store value persist between job restart. Most likely that is not the case. How about if we handle this on the Redis write level? Something along the line of: https://stackoverflow.com/questions/22906778/conditional-redis-set-only-update-with-newest-version I have not researched much on this yet, will explore a bit more to see if this is viable. |
I think we would want to ensure that this apply to all serving stores that we implement, so I would prefer to do this at the stream level. We would also have more metrics that we could collect from these events if we capture it in the stream. |
|
/ok-to-test |
There was a problem hiding this comment.
Should we make this >= so that it is possible to overwrite values with the same event_time?
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: khorshuheng The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
@khorshuheng just picking this up again because I think it's important to get this merged in.
Would it be possible to update this PR to have:
|
|
In this implementation if we set the Timer expiry duration to 3 days. And 5 days later:
The values in Serving store will be replaced with the old ones right because when this ingestion (that happens 5 days later) runs, Just wanna make sure that this is ok and expected. |
Yes, that is correct. The trade off is that it now allows us to ingest data to replace data in the serving store if we ingested the wrong data at a future timestamp. If we allow for this configuration at the feature set or subscription level, then it can be configured to "infinite" for some feature sets if needed. Another idea is to have another flag at ingestion like |
499d792 to
2f00898
Compare
2f00898 to
7f4e382
Compare
|
@khorshuheng: The following tests failed, say
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
We tested the current approach with a dataset of 600k unique keys (hence none of the row will get filtered away for this particular dataset), using both Direct Runner and Dataflow Runner. in both case, the inclusion of state store has not resulted in significant memory increase. However, the latency increases as the number of keys in the state store grows. In our use case the Redis ingestion throughout drop more than half. |
|
@khorshuheng should we perhaps close this PR for the time being? |
This is still a work in progress.
The purpose of this PR, is to ensure that out of order feature row does not overwrite the existing, more recent values stored on Redis.
To achieve this, this PR has defined a new transformation which filter out of order feature rows based on event timestamp. The events are first grouped by the key (similar to Redis Key), then the most recent event timestamp by key is stored and updated using StateId.
As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.
This approach is also extensible to non Redis IO pipeline.
Caveat:
Alternative considered:
We would need to discuss more on whether the current approach is the most suitable one. Once confirmed i can continue integrating this into the current Feast ingestion workflow.