Skip to content

Filter out of order feature rows#273

Closed
khorshuheng wants to merge 1 commit intofeast-dev:masterfrom
khorshuheng:filter-out-of-order
Closed

Filter out of order feature rows#273
khorshuheng wants to merge 1 commit intofeast-dev:masterfrom
khorshuheng:filter-out-of-order

Conversation

@khorshuheng
Copy link
Copy Markdown
Collaborator

This is still a work in progress.

The purpose of this PR, is to ensure that out of order feature row does not overwrite the existing, more recent values stored on Redis.

To achieve this, this PR has defined a new transformation which filter out of order feature rows based on event timestamp. The events are first grouped by the key (similar to Redis Key), then the most recent event timestamp by key is stored and updated using StateId.

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

This approach is also extensible to non Redis IO pipeline.

Caveat:

  1. This will increase the memory consumption of the pipeline workers, especially if the expiry duration is too long. As such, this PR would not be able to meet the requirement to handle out of order messages for long duration, especially if the throughput is high.

Alternative considered:

  1. Retrieve the timestamp stored on Redis, compare with the current timestamp, then update if necessary. This approach is difficult to batch, and is also RedisIO specific. The latency will also be higher as compared to the current approach.

We would need to discuss more on whether the current approach is the most suitable one. Once confirmed i can continue integrating this into the current Feast ingestion workflow.

@feast-ci-bot
Copy link
Copy Markdown
Collaborator

Hi @khorshuheng. Thanks for your PR.

I'm waiting for a gojek member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@woop
Copy link
Copy Markdown
Member

woop commented Oct 29, 2019

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

@woop
Copy link
Copy Markdown
Member

woop commented Oct 29, 2019

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

@khorshuheng
Copy link
Copy Markdown
Collaborator Author

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

Yes, this will be allowed through, and it is intended. If we require out of order feature rows to be handle in a larger time window, such as in the order of days, then the approach in this current PR would not be feasible. Instead, we would have to store the state externally, such as Redis. Though that will come at a price of increased latency.

An alternative which i can think of, without sacrificing latency, would be to define a maximum allowable out-of-date duration. Eg. features which are 1 hour older than the ingestion time would be rejected, regardless of ingestion order. Would that fulfil the typical use cases of Feast?

@khorshuheng
Copy link
Copy Markdown
Collaborator Author

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

@woop
Copy link
Copy Markdown
Member

woop commented Oct 29, 2019

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

@woop
Copy link
Copy Markdown
Member

woop commented Oct 29, 2019

Hi @khorshuheng, thanks so much for the submission!

As the state is stored in memory, an expiry timer has also been defined to clear the state store after a predefined duration. As a side effect, Feast user can fix bad data by reingesting the same feature row with the same timestamp, as long as it is after the expiration duration. This also means that if a user misconfigured the event time on a “feature row” and writes a value far into the future, the key would not be stuck forever.

Can we drill into this functionality a bit: What would happen if the user was streaming in data which was ordered. Then they stop streaming and wait a while until expiration stops. Then they start streaming in very old data. Would this be allowed through? Is that intended?

Yes, this will be allowed through, and it is intended. If we require out of order feature rows to be handle in a larger time window, such as in the order of days, then the approach in this current PR would not be feasible. Instead, we would have to store the state externally, such as Redis. Though that will come at a price of increased latency.

An alternative which i can think of, without sacrificing latency, would be to define a maximum allowable out-of-date duration. Eg. features which are 1 hour older than the ingestion time would be rejected, regardless of ingestion order. Would that fulfil the typical use cases of Feast?

Ok I see where you are coming from, but I still need to clarify how this will map to our use cases.

We must be able to support ingestion of data that has an event time that is of any age. For example users could ingest feature rows with timestamps of 1970 and we should support that. Although in the cases that we are going back far in history (months), we typically do not have very granular data. So the key space should not be too big. However, this could be bigger than a single node can handle. Especially if that worker node is handling ingestion for multiple ingestion jobs that are distinct.

It seems like the expiration of keys serves mostly the purpose of allowing overwriting or "fixing" of bad data that has been ingested. Even though supporting that would be great, I would rather give up that functionality to ensure that we don't clobber new values with old values accidentally.

Storing state in Redis makes sense, but I would love to be able to get around having that as a dependency for this functionality.

@khorshuheng
Copy link
Copy Markdown
Collaborator Author

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

Good catch, I will need to check Apache Beam documentation if the state store value persist between job restart. Most likely that is not the case.

How about if we handle this on the Redis write level? Something along the line of: https://stackoverflow.com/questions/22906778/conditional-redis-set-only-update-with-newest-version

I have not researched much on this yet, will explore a bit more to see if this is viable.

@woop
Copy link
Copy Markdown
Member

woop commented Oct 30, 2019

@khorshuheng How would you want to handle initializing the startup state for this system if no LAST_UPDATED time has been set, and does it matter?

It doesn't matter if the LAST_UPDATED time has not been set. If the current state is null, the feature row will be consider up to date, and the current event timestamp will be written to the state store.

Does this mean that when a job is restarted we will potentially allow older keys to overwrite newer keys?

Good catch, I will need to check Apache Beam documentation if the state store value persist between job restart. Most likely that is not the case.

How about if we handle this on the Redis write level? Something along the line of: https://stackoverflow.com/questions/22906778/conditional-redis-set-only-update-with-newest-version

I have not researched much on this yet, will explore a bit more to see if this is viable.

I think we would want to ensure that this apply to all serving stores that we implement, so I would prefer to do this at the stream level. We would also have more metrics that we could collect from these events if we capture it in the stream.

@woop
Copy link
Copy Markdown
Member

woop commented Nov 5, 2019

/ok-to-test

@woop woop changed the base branch from 0.3-dev to master November 17, 2019 10:51
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we make this >= so that it is possible to overwrite values with the same event_time?

@feast-ci-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: khorshuheng
To complete the pull request process, please assign woop
You can assign the PR to them by writing /assign @woop in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@woop
Copy link
Copy Markdown
Member

woop commented Dec 22, 2019

@khorshuheng just picking this up again because I think it's important to get this merged in.

  • I like the idea of not adding a new dependency. I would want to avoid handling this at the Redis level.
  • We can make this functionality optional based on the serving/store deployment's configuration. I think we should assume that this will only be enabled for online stores, not historical stores.
  • I thought quite long about whether we even need key expiration. I think we can keep it in, but as long as we set the default duration quite long (1d+) then I think it would serve most use cases. The timestamp pool could be refreshed by batch or streaming jobs, and in theory they would never expire for most keys.

Would it be possible to update this PR to have:

  • Configuration to enable this at the store level
  • Integration with Redis (and possibly Cassandra if its merged in)

@davidheryanto
Copy link
Copy Markdown
Collaborator

davidheryanto commented Dec 23, 2019

In this implementation if we set the Timer expiry duration to 3 days. And 5 days later:

  • someone uses Python SDK to ingest old data with event timestamps say 1 month ago
  • the entity fields values of this old data overlaps the existing ones in serving Store

The values in Serving store will be replaced with the old ones right because when this ingestion (that happens 5 days later) runs, lastUpdatedState has been cleared?

Just wanna make sure that this is ok and expected.

@woop
Copy link
Copy Markdown
Member

woop commented Dec 23, 2019

In this implementation if we set the Timer expiry duration to 3 days. And 5 days later:

  • someone uses Python SDK to ingest old data with event timestamps say 1 month ago
  • the entity_id fields values of this old data overlaps the existing ones in serving Store

The values in Serving store will be replaced with the old ones right because when this ingestion (that happens 5 days later) runs, lastUpdatedState has been cleared?

Just wanna make sure that this is ok and expected.

Yes, that is correct.

The trade off is that it now allows us to ingest data to replace data in the serving store if we ingested the wrong data at a future timestamp. If we allow for this configuration at the feature set or subscription level, then it can be configured to "infinite" for some feature sets if needed.

Another idea is to have another flag at ingestion like force_write which bypasses this functionality. That would allow users to control when ordering is respected and when not. The problem with this approach is that it would have no order guarantees, even within a single ingest. So users would probably have to ingest with force_write and then do another ingest without it.

@woop woop added this to the v0.5.0 milestone Mar 9, 2020
@woop woop added kind/techdebt priority/p0 Highest priority labels Mar 9, 2020
@woop woop mentioned this pull request Mar 9, 2020
@khorshuheng khorshuheng force-pushed the filter-out-of-order branch from 2f00898 to 7f4e382 Compare March 12, 2020 01:55
@feast-ci-bot
Copy link
Copy Markdown
Collaborator

@khorshuheng: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
test-core-and-ingestion 7f4e382 link /test test-core-and-ingestion
test-end-to-end 7f4e382 link /test test-end-to-end
test-end-to-end-batch 7f4e382 link /test test-end-to-end-batch

Full PR test history

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@khorshuheng
Copy link
Copy Markdown
Collaborator Author

We tested the current approach with a dataset of 600k unique keys (hence none of the row will get filtered away for this particular dataset), using both Direct Runner and Dataflow Runner. in both case, the inclusion of state store has not resulted in significant memory increase. However, the latency increases as the number of keys in the state store grows.

In our use case the Redis ingestion throughout drop more than half.

@woop
Copy link
Copy Markdown
Member

woop commented Apr 29, 2020

@khorshuheng should we perhaps close this PR for the time being?

@woop woop removed this from the v0.5.0 milestone Apr 29, 2020
@woop woop removed the priority/p0 Highest priority label Apr 29, 2020
@woop woop closed this May 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants