Encode feature row before storing in Redis#530
Encode feature row before storing in Redis#530feast-ci-bot merged 2 commits intofeast-dev:masterfrom
Conversation
|
/test test-core-and-ingestion |
ingestion/src/main/java/feast/store/serving/redis/FeatureRowToRedisMutationDoFn.java
Show resolved
Hide resolved
| private byte[] getValue(FeatureRow featureRow) { | ||
| FeatureRowEncoder encoder = | ||
| new FeatureRowEncoder(featureSets.get(featureRow.getFeatureSet()).getSpec()); | ||
| return encoder.encode(featureRow).toByteArray(); |
There was a problem hiding this comment.
Would this make more sense as a static method encode(featureRow, featureSet)?
There was a problem hiding this comment.
Yup, a static method does make more sense. Unless we need to support multiple encoding mechanism, which is unlikely.
| * @return boolean | ||
| */ | ||
| public Boolean isEncodingValid(FeatureRow featureRow) { | ||
| return featureRow.getFieldsList().size() == spec.getFeaturesList().size(); |
There was a problem hiding this comment.
If you discover an old feature row (that is still within max age), how will you interpret it? Are we using the versions here to be able to interpret old feature rows?
There was a problem hiding this comment.
Redis Key contains both the feature set name and version. Since both ingestion and serving encode/decode based on the feature set name and version specified in Redis Key, the spec should be consistent, and therefore isEncodingValid will always return true.
Originally i added this method to partially address the concern raised in #515:
If field values are only going to be associated with field names at runtime by external configuration has any thought been given to a method for ensuring that the same configuration that was used to write the data is the configuration used to read the data? Something such as a checksum/fingerprint of the FeatureSet configuration stored alongside the data in Redis (or in the key) will help prevent a mismatch of configuration and data due to a bug somewhere else in the system.
Perhaps i should remove this method from this PR?
There was a problem hiding this comment.
For this thread and the PR as a whole especially the part quoted below, does this approach fall down and need to be reworked as versions get removed?
In cases where the Feature Row are from external Kafka source, where it is possible for the feature row to be malformed, there are a few checks in place:
- It's fine for the feature row to have feature fields in a different order than that specified in the feature set spec, as the encoding process will sort the feature in alphabetical order based on name.
There was a problem hiding this comment.
Oh yeah, @zhilingc has brought this up in a comment thread on her versions removal RFC. Ouch 🤕
|
/test test-end-to-end-batch |
|
/lgtm |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: khorshuheng, zhilingc The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* Encode feature row before storing in Redis * Include encoding as part of RedisMutationDoFn Co-authored-by: Khor Shu Heng <khor.heng@gojek.com>
* Encode feature row before storing in Redis * Include encoding as part of RedisMutationDoFn Co-authored-by: Khor Shu Heng <khor.heng@gojek.com>
What this PR does / why we need it:
Currently, feature names are stored together with the values in Redis, which result in excessive memory foot print in cases where feature names are long strings.
This PR will encode the Feature Row prior to storing them in Redis. Encoding strategy:
In cases where the Feature Row are from external Kafka source, where it is possible for the feature row to be malformed, there are a few checks in place:
Good to have, but out of scope for this PR:
Which issue(s) this PR fixes:
Fixes #515.
Does this PR introduce a user-facing change?:
After this patch, the ingestion job will start storing the feature row in encoded format. However, the serving job is still able to read the non encoded feature row, hence it is not necessary to re-ingest the data prior to the application of the patch, unless the user wish to reduce the memory footprint of the existing keys.