Filter out of order feature rows by khorshuheng · Pull Request #273 · feast-dev/feast

khorshuheng · 2019-10-26T14:37:07Z

This is still a work in progress.

The purpose of this PR, is to ensure that out of order feature row does not overwrite the existing, more recent values stored on Redis.

To achieve this, this PR has defined a new transformation which filter out of order feature rows based on event timestamp. The events are first grouped by the key (similar to Redis Key), then the most recent event timestamp by key is stored and updated using StateId.

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

This approach is also extensible to non Redis IO pipeline.

Caveat:

This will increase the memory consumption of the pipeline workers, especially if the expiry duration is too long. As such, this PR would not be able to meet the requirement to handle out of order messages for long duration, especially if the throughput is high.

Alternative considered:

Retrieve the timestamp stored on Redis, compare with the current timestamp, then update if necessary. This approach is difficult to batch, and is also RedisIO specific. The latency will also be higher as compared to the current approach.

We would need to discuss more on whether the current approach is the most suitable one. Once confirmed i can continue integrating this into the current Feast ingestion workflow.

feast-ci-bot · 2019-10-26T14:37:22Z

Hi @khorshuheng. Thanks for your PR.

I'm waiting for a gojek member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

woop · 2019-10-29T07:54:56Z

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

woop · 2019-10-29T07:56:26Z

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

khorshuheng · 2019-10-29T08:14:32Z

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

Yes, this will be allowed through, and it is intended. If we require out of order feature rows to be handle in a larger time window, such as in the order of days, then the approach in this current PR would not be feasible. Instead, we would have to store the state externally, such as Redis. Though that will come at a price of increased latency.

An alternative which i can think of, without sacrificing latency, would be to define a maximum allowable out-of-date duration. Eg. features which are 1 hour older than the ingestion time would be rejected, regardless of ingestion order. Would that fulfil the typical use cases of Feast?

khorshuheng · 2019-10-29T08:16:53Z

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

woop · 2019-10-29T09:16:11Z

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

woop · 2019-10-29T09:25:41Z

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

Yes, this will be allowed through, and it is intended. If we require out of order feature rows to be handle in a larger time window, such as in the order of days, then the approach in this current PR would not be feasible. Instead, we would have to store the state externally, such as Redis. Though that will come at a price of increased latency.

An alternative which i can think of, without sacrificing latency, would be to define a maximum allowable out-of-date duration. Eg. features which are 1 hour older than the ingestion time would be rejected, regardless of ingestion order. Would that fulfil the typical use cases of Feast?

Ok I see where you are coming from, but I still need to clarify how this will map to our use cases.

We must be able to support ingestion of data that has an event time that is of any age. For example users could ingest feature rows with timestamps of 1970 and we should support that. Although in the cases that we are going back far in history (months), we typically do not have very granular data. So the key space should not be too big. However, this could be bigger than a single node can handle. Especially if that worker node is handling ingestion for multiple ingestion jobs that are distinct.

It seems like the expiration of keys serves mostly the purpose of allowing overwriting or "fixing" of bad data that has been ingested. Even though supporting that would be great, I would rather give up that functionality to ensure that we don't clobber new values with old values accidentally.

Storing state in Redis makes sense, but I would love to be able to get around having that as a dependency for this functionality.

khorshuheng · 2019-10-29T09:56:31Z

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

Good catch, I will need to check Apache Beam documentation if the state store value persist between job restart. Most likely that is not the case.

How about if we handle this on the Redis write level? Something along the line of: https://stackoverflow.com/questions/22906778/conditional-redis-set-only-update-with-newest-version

I have not researched much on this yet, will explore a bit more to see if this is viable.

woop · 2019-10-30T03:45:11Z

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

Good catch, I will need to check Apache Beam documentation if the state store value persist between job restart. Most likely that is not the case.

How about if we handle this on the Redis write level? Something along the line of: https://stackoverflow.com/questions/22906778/conditional-redis-set-only-update-with-newest-version

I have not researched much on this yet, will explore a bit more to see if this is viable.

I think we would want to ensure that this apply to all serving stores that we implement, so I would prefer to do this at the stream level. We would also have more metrics that we could collect from these events if we capture it in the stream.

woop · 2019-11-05T01:29:13Z

/ok-to-test

woop · 2019-12-22T13:44:33Z

Should we make this >= so that it is possible to overwrite values with the same event_time?

feast-ci-bot · 2019-12-22T13:44:49Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: khorshuheng
To complete the pull request process, please assign woop
You can assign the PR to them by writing /assign @woop in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

woop · 2019-12-22T13:46:52Z

@khorshuheng just picking this up again because I think it's important to get this merged in.

I like the idea of not adding a new dependency. I would want to avoid handling this at the Redis level.
We can make this functionality optional based on the serving/store deployment's configuration. I think we should assume that this will only be enabled for online stores, not historical stores.
I thought quite long about whether we even need key expiration. I think we can keep it in, but as long as we set the default duration quite long (1d+) then I think it would serve most use cases. The timestamp pool could be refreshed by batch or streaming jobs, and in theory they would never expire for most keys.

Would it be possible to update this PR to have:

Configuration to enable this at the store level
Integration with Redis (and possibly Cassandra if its merged in)

davidheryanto · 2019-12-23T03:02:48Z

In this implementation if we set the Timer expiry duration to 3 days. And 5 days later:

someone uses Python SDK to ingest old data with event timestamps say 1 month ago
the entity fields values of this old data overlaps the existing ones in serving Store

The values in Serving store will be replaced with the old ones right because when this ingestion (that happens 5 days later) runs, lastUpdatedState has been cleared?

Just wanna make sure that this is ok and expected.

woop · 2019-12-23T04:29:47Z

In this implementation if we set the Timer expiry duration to 3 days. And 5 days later:

someone uses Python SDK to ingest old data with event timestamps say 1 month ago

the entity_id fields values of this old data overlaps the existing ones in serving Store

The values in Serving store will be replaced with the old ones right because when this ingestion (that happens 5 days later) runs, lastUpdatedState has been cleared?

Just wanna make sure that this is ok and expected.

Yes, that is correct.

The trade off is that it now allows us to ingest data to replace data in the serving store if we ingested the wrong data at a future timestamp. If we allow for this configuration at the feature set or subscription level, then it can be configured to "infinite" for some feature sets if needed.

Another idea is to have another flag at ingestion like force_write which bypasses this functionality. That would allow users to control when ordering is respected and when not. The problem with this approach is that it would have no order guarantees, even within a single ingest. So users would probably have to ingest with force_write and then do another ingest without it.

feast-ci-bot · 2020-03-12T01:59:44Z

@khorshuheng: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
test-core-and-ingestion	`7f4e382`	link	`/test test-core-and-ingestion`
test-end-to-end	`7f4e382`	link	`/test test-end-to-end`
test-end-to-end-batch	`7f4e382`	link	`/test test-end-to-end-batch`

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

khorshuheng · 2020-03-28T07:59:54Z

We tested the current approach with a dataset of 600k unique keys (hence none of the row will get filtered away for this particular dataset), using both Direct Runner and Dataflow Runner. in both case, the inclusion of state store has not resulted in significant memory increase. However, the latency increases as the number of keys in the state store grows.

In our use case the Redis ingestion throughout drop more than half.

woop · 2020-04-29T13:54:39Z

@khorshuheng should we perhaps close this PR for the time being?

khorshuheng requested review from pradithya, tims and zhilingc as code owners October 26, 2019 14:37

feast-ci-bot added needs-ok-to-test size/L labels Oct 26, 2019

feast-ci-bot added ok-to-test and removed needs-ok-to-test labels Nov 5, 2019

woop changed the base branch from 0.3-dev to master November 17, 2019 10:51

woop requested changes Dec 22, 2019

View reviewed changes

khorshuheng force-pushed the filter-out-of-order branch from 499d792 to 2f00898 Compare February 14, 2020 07:06

woop added this to the v0.5.0 milestone Mar 9, 2020

woop added kind/techdebt priority/p0 Highest priority labels Mar 9, 2020

woop mentioned this pull request Mar 9, 2020

Feast 0.5 release #527

Closed

Filter out of order feature rows

7f4e382

khorshuheng force-pushed the filter-out-of-order branch from 2f00898 to 7f4e382 Compare March 12, 2020 01:55

woop removed this from the v0.5.0 milestone Apr 29, 2020

woop removed the priority/p0 Highest priority label Apr 29, 2020

woop closed this May 8, 2020

Conversation

khorshuheng commented Oct 26, 2019

Uh oh!

feast-ci-bot commented Oct 26, 2019

Uh oh!

woop commented Oct 29, 2019

Uh oh!

woop commented Oct 29, 2019

Uh oh!

khorshuheng commented Oct 29, 2019

Uh oh!

khorshuheng commented Oct 29, 2019

Uh oh!

woop commented Oct 29, 2019

Uh oh!

woop commented Oct 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

khorshuheng commented Oct 29, 2019

Uh oh!

woop commented Oct 30, 2019

Uh oh!

woop commented Nov 5, 2019

Uh oh!

woop Dec 22, 2019

Choose a reason for hiding this comment

Uh oh!

feast-ci-bot commented Dec 22, 2019

Uh oh!

woop commented Dec 22, 2019

Uh oh!

davidheryanto commented Dec 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

woop commented Dec 23, 2019

Uh oh!

feast-ci-bot commented Mar 12, 2020

Uh oh!

khorshuheng commented Mar 28, 2020

Uh oh!

woop commented Apr 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

woop commented Oct 29, 2019 •

edited

Loading

davidheryanto commented Dec 23, 2019 •

edited

Loading