-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Expected Behavior
In BigQueryRetrievalJob, when I call to_remote_storage(), the return value that I would expect would be the paths of the parquet files that have been written to GCS...
Current Behavior
...however, it turns out the the paths that are returned are all parquets ever that you have written to the bucket that you are using to store these parquets.
For example, say you set you gcs_staging_location in your feature_store.yaml to feast-materialize-dev and project_id to my_feature-store, then the self._gcs_path, as defined here will be: gs://feast-materialize-dev/my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731 (or some other uuid string, but you get the idea). However, the rest of the code in the to_remote_storage method returns all paths that are in the path gs://feast-materialize-dev/export which is not we we want, as the parquets are written to the self._gcs_path.
Steps to reproduce
You can see that the code is wrong with a simple example:
Current code (pretty much from this. In this example you might imagine there are parquets created from the to-remote_storage call under gs://feast-materialize-dev/ki_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/ but from a previous call there are also some at gs://feast-materialize-dev/ki_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/:
>>> from google.cloud.storage import Client as StorageClient
>>> _gcs_path = "gs://feast-materialize-dev/my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731"
>>> bucket, prefix = _gcs_path[len("gs://") :].split("/", 1)
>>> print(bucket)
'feast-materialize-dev'
>>> print(prefix)
'my_feature_store/export/ff67c43e-7174-475f-a02c-6c7587d89731'
>>> prefix = prefix.rsplit("/", 1)[0] # THIS IS THE LINE THAT WE DO NOT WANT
>>> print(prefix)
'my_feature_store/export'
>>> if prefix.startswith("/"):
>>> prefix = prefix[1:]
>>> print(prefix)
'my_feature_store/export'
>>> storage_client = StorageClient()
>>> blobs = storage_client.list_blobs(bucket, prefix=prefix)`
>>> results = []
>>> for b in blobs:
>>> results.append(f"gs://{b.bucket.name}/{b.name}")
>>> print(results)
["gs://feast-materialize-dev/my_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/000000000000.parquet", "gs://feast-materialize-dev/my_feature_store/export/19a1c772-1f91-44da-8486-ea476f027d93/000000000001.parquet", "gs://feast-materialize-dev/my_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/000000000000.parquet", "gs://feast-materialize-dev/my_feature_store/export/e00597db-78d5-40e1-b125-eac903802acd/000000000001.parquet"] You can see in this example, there are parquets paths returned that are not [art of the self._gcs_path and therefore the write to gcs that occurred in this call. This is not what i would expect.
Possible Solution
The corrected code would simply not include the line prefix = prefix.rsplit("/", 1)[0]